Skip Headers
Oracle® Big Data Connectors User's Guide
Release 1 (1.1)

Part Number E36049-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

5 Oracle R Connector for Hadoop

This chapter describes R support for big data. It contains the following sections:

5.1 About Oracle R Connector for Hadoop

Oracle R Connector for Hadoop is an R package that provides an interface between the local R environment and Apache Hadoop. You install and load this package as you would any other R package. Using simple R functions, you can copy data between R memory, the local file system, and HDFS. You can schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations.

To use Oracle R Connector for Hadoop, you should be familiar with R programming and statistical methods.

See Also:

To get started using R, see the R Project website at

http://www.r-project.org/

5.1.1 Oracle R Connector for Hadoop APIs

Oracle R Connector for Hadoop provides access from a local R client to Apache Hadoop using these APIs:

  • hadoop: Provides an interface to Hadoop MapReduce.

  • hdfs: Provides an interface to HDFS

  • orch: Provides an interface between the local R instance and Oracle Database

All of the APIs are included in the ORCH library. The functions are listed in this chapter in alphabetical order.

5.1.2 Access to Oracle Database

A separate product, Oracle R Enterprise, provides direct read and write access to Oracle Database and enables you to perform statistical analysis on database tables, views, and other data objects. Access to the data stored in an Oracle database is always restricted to the access rights granted by your DBA.

After analyzing the unstructured data in HDFS using Oracle R Connector for Hadoop, you can store the results in an Oracle database using Oracle R Enterprise. You can then perform additional analysis on this smaller set of data.

Oracle R Enterprise is included in the Oracle Advanced Analytics option to Oracle Database Enterprise Edition. It is not one of the Oracle Big Data Connectors.

5.2 Scenarios for Using Oracle R Packages

The following scenario may help you identify opportunities for using Oracle R Connector for Hadoop with Oracle R Enterprise.

Using Oracle R Connector for Hadoop, you might look for files that you have access to on HDFS and then schedule R calculations to execute on data in one such file. Furthermore, you can upload data stored in text files on your local file system into HDFS for calculations, schedule an R script for execution on the Hadoop cluster, and download the results into a local file.

Using Oracle R Enterprise, you can open the R interface and connect to Oracle Database to work on the tables and views that are visible based on your database privileges. You can filter out rows, add derived columns, project new columns, and perform visual and statistical analysis.

Again using Oracle R Connector for Hadoop, you might deploy a MapReduce job on Hadoop for CPU-intensive calculations written in R. The calculation can use data stored in HDFS or, with Oracle R Enterprise, in an Oracle database. You can return the output of the calculation to an Oracle database and to the R console for visualization or additional processing.

5.3 Security Notes for Oracle R Connector for Hadoop

Oracle R Connector for Hadoop invokes the Sqoop utility to connect to Oracle Database either to extract data or to store results. Sqoop is a command-line utility for Hadoop that imports and exports data between HDFS or Hive and structured databases. The name Sqoop comes from "SQL to Hadoop."

The following explains how Oracle R Connector for Hadoop stores a database user password and sends it to Sqoop.

Oracle R Connector for Hadoop stores a user password only when the user establishes the database connection in a mode that does not require reentering the password each time. The password is stored encrypted in memory. See orch.connect.

Oracle R Connector for Hadoop generates a configuration file for Sqoop and uses it to invoke Sqoop locally. The file contains the user's database password obtained by either prompting the user or from the encrypted in-memory representation. The file has local user access permissions only. The file is created, the permissions are set explicitly, and then the file is open for writing and filled with data.

Sqoop uses the configuration file to generate custom JAR files dynamically for the specific database job and passes the JAR files to the Hadoop client software. The password is stored inside the compiled JAR file; it is not stored in plain text.

The JAR file is transferred to the Hadoop cluster over a network connection. The network connection and the transfer protocol are specific to Hadoop, such as port 5900.

The configuration file is deleted after Sqoop finishes compiling its JAR files and starts its own Hadoop jobs.

5.4 Functions in Alphabetical Order

hadoop.exec
hadoop.run
hdfs.attach
hdfs.cd
hdfs.cp
hdfs.describe
hdfs.download
hdfs.exists
hdfs.get
hdfs.id
hdfs.ls
hdfs.mkdir
hdfs.mv
hdfs.parts
hdfs.pull
hdfs.push
hdfs.put
hdfs.pwd
hdfs.rm
hdfs.rmdir
hdfs.root
hdfs.sample
hdfs.setroot
hdfs.size
hdfs.upload
is.hdfs.id
orch.connect
orch.dbcon
orch.dbg.off
orch.dbg.on
orch.dbg.output
orch.dbinfo
orch.disconnect
orch.dryrun
orch.export
orch.keyval
orch.keyvals
orch.pack
orch.reconnect
orch.unpack
orch.version

5.5 Functions by Category

The functions are grouped into these categories:

5.5.4 Writing MapReduce Functions

orch.dryrun
orch.keyval
orch.keyvals

5.5.6 Executing Scripts

hadoop.exec
hadoop.run
orch.dryrun

5.6 ORCH mapred.config Class

The hadoop.exec and hadoop.run functions have an optional argument, config, for configuring the resultant MapReduce job. This argument is an instance of the mapred.config class.

The mapred.config class has these slots:

hdfs.access

Set to TRUE to allow access to the HDFS.* functions in the mappers, reducers, and combiners, or set to FALSE to restrict access (default).

job.name

A descriptive name for the job so that you can monitor its progress more easily.

map.input

The mapper input data type: data.frame, list, or vector (default).

map.output

The name of a sample data frame that the mapper generates, which defines the output structure from the mappers. This data frame helps the reducers to decipher the metadata of the files generated by the mappers.

map.split

The number of data values given at one time to the mapper:

  • 0 sends all the values at one time.

  • 1 sends one row only to the mapper at a time.

  • n sends a minimum of n rows to the mapper at a time. In this syntax, n is an integer greater than 1. Some algorithms require a minimum number of rows to function.

map.tasks

The number of mappers to run. Specify 1 to run the mappers sequentially; specify a larger integer to run the mappers in parallel.

map.valkey

Set to TRUE to duplicate the keys as data values for the mapper, or FALSE to use the keys only as keys (default).

min.split.size

The minimum number of rows to send to the mapper at one time.

reduce.input

A reducer input data type: data.frame or list (default).

reduce.output

The name of a sample data frame that defines the output structure from the reducers. The reducer generates this data frame, which is used to interpret the metadata of the final file it generates.

reduce.split

The number of data values given at one time to the reducer. See the values for map.split.

reduce.tasks

The number of reducers to run. Specify 1 to run the reducers sequentially; specify a larger integer to run the reducers in parallel.

reduce.valkey

Set to TRUE to duplicate the keys as data values for the reducer, or FALSE to use the keys only as keys (default).

verbose

Set to TRUE to generate diagnostic information, or FALSE otherwise.

5.7 Example R Programs Using Oracle R Connector for Hadoop

The ORCH package includes sample code to help you learn to adapt your R programs to run on a Hadoop cluster using Oracle R Connector for Hadoop. This topic describes these examples and demonstrations.

5.7.1 Using the Examples

The examples show how you use the Oracle R Connector for Hadoop API. You can view them in a text editor after extracting them from orch.tgz. They are located in the orch/inst/examples directory.

Example R Programs 

example-debug.R

Shows how to use key-value pairs, enable debugging, and check for errors. The mapper function selects only the SFO airport data from the ONTIME_S data set, and the reducer calculates the mean arrival delay.

This program requires Oracle R Enterprise for the ore.pull function. It uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.dbg.on
orch.dbg.output
orch.keyval
example-filter1.R

Shows how to use key-value pairs. The mapper function selects cars with a distance value greater than 30 from the cars data set, and the reducer function calculates the mean distance for each speed.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.keyval
example-filter2.R

Shows how to use values only. The mapper function selects cars with a distance greater than 30 and a speed greater than 14 from the cars data set, and the reducer function calculates the mean speed and distance as one value pair.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.keyval
example-filter3.R

Shows how to load a local file into HDFS. The mapper and reducer functions are the same as example-filter2.R.

This program uses the following ORCH functions:

hadoop.run
hdfs.download
hdfs.upload
orch.keyval
example-group.apply.R

Shows how to build a parallel model and generate a graph. The mapper partitions the data based on the petal lengths in the iris data set, and the reducer uses basic R statistics and graphics functions to fit the data into a linear model and plot a graph.

This program uses the following ORCH functions:

hadoop.run
hdfs.download
hdfs.exists
hdfs.get
hdfs.id
hdfs.mkdir
hdfs.put
hdfs.rmdir
hdfs.upload
orch.pack
orch.export
orch.unpack
example-kmeans.R

Shows how to run a simple logistic regression in Hadoop. This program defines a k-means clustering function and generates random points for a clustering test. The results are printed or graphed.

This program uses the following ORCH functions:

hadoop.exec
hdfs.get
hdfs.put
orch.export
example-lm.R

Shows how to define multiple mappers and one reducer that merges all results. The program calculates a linear regression using the iris data set.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.export
orch.pack
orch.unpack
example-lmqr.R

Shows more complex analysis, mappers, and reducers for calculating a linear regression using the iris data set. The values are computed in three phases by executing three MapReduce jobs.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.export
orch.pack
orch.unpack
example-logreg.R

Performs a one-dimensional, logistic regression on the cars data set.

This program uses the following ORCH functions:

hadoop.run
hdfs.put
orch.export
example-map.df.R

Shows how to run the mapper with an unlimited number of records at one time input as a data frame. The mapper selects cars with a distance greater than 30 from the cars data set and calculates the mean distance. The reducer merges the results.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
example-map.list.R

Shows how to run the mapper with an unlimited number of records at one time input as a list. The mapper selects cars with a distance greater than 30 from the cars data set and calculates the mean distance. The reducer merges the results.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
example-model.plot.R

Shows how to create models and graphs using HDFS. The mapper provides key-value pairs from the iris data set to the reducer. The reducer creates a linear model from data extracted from the data set, plots the results, and saves them in an HDFS file.

This program uses the following ORCH functions:

hadoop.run
hdfs.download
hdfs.exists
hdfs.get
hdfs.id
hdfs.mkdir
hdfs.put
hdfs.rmdir
hdfs.upload
orch.export
orch.pack
example-model.prep.R

Shows how to distribute data across several map tasks. The mapper generates a data frame from a slice of input data from the iris data set. The reducer merges the data frames into one output data set.

This program uses the following ORCH functions:

hadoop.exec
hdfs.get
hdfs.put
orch.export
orch.keyvals
example-rlm.R

Shows how to convert a simple R program into one that can run as a MapReduce job on a Hadoop cluster. In this example, the program calculates and graphs a linear model on the cars data set using basic R functions.

This program uses the following ORCH functions:

hadoop.run
orch.keyvals
orch.unpack
example-split.map.R

Shows how to split the data in the mapper. The first job runs the mapper in list mode and splits the list in the mapper. The second job splits a data frame in the mapper. Both jobs use the cars data set.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.keyval
example-split.reduce.R

Shows how to split the data from the cars data set in the reducer.

This program uses the following ORCH functions:

hadoop.run
hdfs.get
hdfs.put
orch.keyval
example-sum.R

Shows how to perform a sum operation in a MapReduce job. The first job sums a vector of numeric values, and the second job sums all columns of a data frame.

This program uses the following ORCH functions:

hadoop.run
orch.keyval
example-teragen.matrix.R

Shows how to generate large data sets in a matrix for testing programs in Hadoop. The mappers generate samples of random data, and the reducers merge them.

This program uses the following ORCH functions:

hadoop.run
hdfs.put
orch.export
orch.keyvals
example-teragen.xy.R

Shows how to generate large data sets in a data frame for testing programs in Hadoop. The mappers generate samples of random data, and the reducers merge them.

This program uses the following ORCH functions:

hadoop.run
hdfs.put
orch.export
orch.keyvals
example-teragen2.xy.R

Shows how to generate large data sets in a data frame for testing programs in Hadoop. One mapper generates small samples of random data, and the reducers merge them.

This program uses the following ORCH functions:

hadoop.run
hdfs.put
orch.export
orch.keyvals
example-terasort.R

Provides an example of a TeraSort job on a set of randomly generated values.

This program uses the following ORCH function:

hadoop.run

5.7.2 Using the Demos

The demos illustrate various analytic techniques that use the Oracle R Connector for Hadoop API. You can view them in a text editor after extracting them from orch.tgz. They are located in the orch/inst/demos directory.

Demo R Programs 

demo-bagged.clust.R

Provides an example of bagged clustering using randomly generated values. The mappers perform k-means clustering analysis on a subset of the data and generate centroids for the reducers. The reducers combine the centroids into a hierarchical cluster and store the hclust object in HDFS.

This program uses the following ORCH functions:

hadoop.exec
hdfs.put
is.hdfs.id
orch.export
orch.keyval
orch.pack
orch.unpack
demo-distance.R

Calculates a similarity matrix using a Euclidian distance metric. This program uses the MovieLens data set, which you can download from

http://grouplens.org/node/73/

This program uses the following ORCH functions:

hadoop.exec
hdfs.put
hdfs.sample
is.hdfs.id
orch.keyval
orch.pack
orch.unpack
demo-kmeans.R

Performs k-means clustering. You must provide this demo with two parameters: an ore-frame object and a matrix of cluster centers. You must have Oracle R Enterprise to create an ore-frame object.

This program uses the following ORCH functions:

hadoop.exec
hdfs.get
orch.export
orch.keyval
orch.pack
orch.unpack
demo-pca.R

Calculates the statistics for the principal components analysis (PCA) of a data set. You must provide this demo with a parameter for the file name of the data set. This program creates a tree of multiple MapReduce jobs, which reduce the number of records at each stage. The final MapReduce job performs the final merge operation and generates the statistics for the entire data set.

This program uses the following ORCH functions:

hadoop.run
orch.export
orch.keyval
orch.pack
orch.unpack
demo-pearson.R

Calculates a similarity matrix using Pearson's correlation metric. This program uses the MovieLens data set, which you can download from

http://grouplens.org/node/73/

This program uses the following ORCH functions:

hadoop.exec
hdfs.put
is.hdfs.id
orch.keyval
orch.pack
orch.unpack

hadoop.exec

Starts the Hadoop engine and sends the mapper, reducer, and combiner R functions for execution. You must load the data into HDFS first.

Usage

hadoop.exec(
        dfs.id, 
        mapper, 
        reducer, 
        combiner, 
        export,
        init,
        final,
        job.name,
        config)

Arguments

dfs.id

The name of a file in HDFS containing data to be processed. The file name can include a path that is either absolute or relative to the current path.

mapper

Name of a mapper function written in the R language.

reducer

Name of a reducer function written in the R language (optional).

combiner

Name of a combiner function written in the R language (optional).

export

Names of exported R objects from your current R environment that are referenced by any of the mapper, reducer, or combiner functions (optional).

init

A function that is executed once before the mapper function begins (optional).

final

A function that is executed once after the reducer function completes (optional).

job.name

A descriptive name that you can use to track the progress of the MapReduce job instead of the automatically generated job name (optional).

config

Sets the configuration parameters for the MapReduce job (optional).

This argument is an instance of the mapred.config class, and thus it has this format:

config = new("mapred.config", param1, param2,...

See "ORCH mapred.config Class" for a description of this class.

Usage Notes

This function provides more control of the data flow than the hadoop.run function. You must use hadoop.exec when chaining several mappers and reducers in a pipeline, because the data does not leave HDFS. The results are stored in HDFS files.

Return Value

Data object identifier in HDFS

See Also

hadoop.run, orch.dryrun

Example

This sample script uses hdfs.attach to obtain the object identifier of a small, sample data file in HDFS named ontime_R.

The MapReduce function counts the number of on-time flights arriving in the San Francisco International Airport (SFO).

dfs <- hdfs.attach('ontime_R')
res <- NULL
res <- hadoop.exec(
    dfs,
    mapper = function(key, ontime) {
        if (key == 'SFO') {
            keyval(key, ontime)
        }
    },
    reducer = function(key, vals) {
        sumAD <- 0
        count <- 0
        for (x in vals) {
           if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1}
        }
        res <- sumAD / count
        keyval(key, res)
    }
)

After the script runs, the res variable identifies the location of the results in an HDFS file named /user/oracle/xq/orch3d0b8218:

R> res
[1] "/user/oracle/xq/orch3d0b8218"
attr(,"dfs.id")
[1] TRUE
R> print(hdfs.get(res))
 val1     val2
1 SFO 27.05804

This code fragment is extracted from example-kmeans.R. The export option identifies the location of the ncenters generated data set, which is exported as an HDFS file. The config options provide a MapReduce job name of k-means.1, and the mapper output format of a data frame.

mapf <- data.frame(key=0, val1=0, val2=0)
    dfs.points <- hdfs.put(points)
    dfs.centers <- hadoop.exec(
        dfs.id = dfs.points,
        mapper = function(k,v) {
            keyval(sample(1:ncenters,1), v)
        },
        reducer = function(k,vv) {
            vv <- sapply(vv, unlist)
            keyval(NULL, c(mean(vv[1,]), mean(vv[2,])))
        },
        export = orch.export(ncenters),
        config = new("mapred.config", 
            job.name = "k-means.1",
            map.output = mapf)

hadoop.run

Starts the Hadoop engine and sends the mapper, reducer, and combiner R functions for execution. If the data is not already stored in HDFS, then hadoop.run first copies the data there.

Usage

hadoop.run(
        data, 
        mapper, 
        reducer, 
        combiner, 
        export,
        init,
        final,
        job.name,
        config)

Arguments

data

Data frame, Oracle R Enterprise frame (ore.frame), or an HDFS file descriptor.

mapper

Name of a mapper function written in the R language.

reducer

Name of a reducer function written in the R language (optional).

combiner

Name of a combiner function written in the R language (optional).

export

Names of exported R objects.

init

A function that is executed once before the mapper function begins (optional).

final

A function that is executed once after the reducer function completes (optional).

job.name

A descriptive name that you can use to track the progress of the job instead of the automatically generated job name (optional).

config

Sets the configuration parameters for the MapReduce job (optional).

This argument is an instance of the mapred.config class, and so it has this format:

config = new("mapred.config", param1, param2,...

See "ORCH mapred.config Class" for a description of this class.

Usage Notes

The hadoop.run function returns the results from HDFS to the source of the input data. For example, the results for HDFS input data are kept in HDFS, and the results for ore.frame input data are copied into an Oracle database.

Return Value

An object in the same format as the input data

See Also

hadoop.exec, orch.dryrun

Example

This sample script uses hdfs.attach to obtain the object identifier of a small, sample data file in HDFS named ontime_R.

The MapReduce function counts the number of on-time flights arriving in the San Francisco International Airport (SFO).

dfs <- hdfs.attach('ontime_R')
res <- NULL
res <- hadoop.run(
    dfs,
    mapper = function(key, ontime) {
        if (key == 'SFO') {
            keyval(key, ontime)
        }
    },
    reducer = function(key, vals) {
        sumAD <- 0
        count <- 0
        for (x in vals) {
           if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1}
        }
        res <- sumAD / count
        keyval(key, res)
    }
)

After the script runs, the location of the results is identified by the res variable, in an HDFS file named /user/oracle/xq/orch3d0b8218:

R> res
[1] "/user/oracle/xq/orch3d0b8218"
attr(,"dfs.id")
[1] TRUE
R> print(hdfs.get(res))
 val1     val2
1 SFO 27.05804

hdfs.attach

Copies data from an unstructured data file in HDFS into the Oracle R Connector for Hadoop framework. By default, data files in HDFS are not visible to the connector. However, if you know the name of the data file, you can use this function to attach it to the Oracle R Connector for Hadoop name space.

Usage

hdfs.attach(
        dfs.name,
        force)

Arguments

dfs.name

The name of a file in HDFS.

force

Controls whether the function attempts to discover the structure of the file and the data type of each column.

FALSE for comma-separated value (CSV) files (default). If a file does not have metadata identifying the names and data types of the columns, then the function samples the data to deduce the data type as number or string. It then re-creates the file with the appropriate metadata.

TRUE for non-CVS files, including binary files. This setting prevents the function from trying to discover the metadata; instead, it simply attaches the file.

Usage Notes

Use this function to attach a CSV file to your R environment, just as you might attach a data frame.

Oracle R Connector for Hadoop does not support the processing of attached non-CVS files. Nonetheless, you can attach a non-CSV file, download it to your local computer, and use it as desired. Alternatively, you can attach the file for use as input to a Hadoop application.

Return Value

The object ID of the file in HDFS, or NULL if the operation failed

See Also

hdfs.download

Example

This example stores the object ID of ontime_R in a variable named dfs, and then displays its value.

R> dfs <- hdfs.attach('ontime_R')
R> dfs
[1] "/user/oracle/xq/ontime_R"
attr(,"dfs.id")
[1] TRUE

hdfs.cd

Sets the default HDFS path.

Usage

hdfs.cd(dfs.path)

Arguments

dfs.path

A path that is either absolute or relative to the current path.

Return Value

TRUE if the path is changed successfully, or FALSE if the operation failed

Example

This example changes the current directory from /user/oracle to /user/oracle/sample:

R> hdfs.cd("sample")
[1] "/user/oracle/sample"

hdfs.cp

Copies an HDFS file from one location to another.

Usage

hdfs.cp(
        dfs.src,
        dfs.dst,
        force)

Arguments

dfs.src

The name of the source file to be copied. The file name can include a path that is either absolute or relative to the current path.

dfs.dst

The name of the copied file. The file name can include a path that is either absolute or relative to the current path.

force

Set to TRUE to overwrite an existing file, or set to FALSE to display an error message (default).

Return Value

NULL for a successful copy, or FALSE for a failed attempt

Example

This example copies a file named weblog in the parent directory and overwrites the existing weblog file:

R> hdfs.cp("weblog", "..", force=T)

hdfs.describe

Returns the metadata associated with a file in HDFS.

Usage

hdfs.describe(dfs.id)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

Return Value

A data frame containing the metadata, or NULL if no metadata was available in HDFS

Example

This example provides information about an HDFS file named ontime_DB:

R> hdfs.describe('ontime_DB')
           name
1          path
2        origin
3         class
4         types
5           dim
6         names
7       has.key
8    key.column
9      null.key
10 has.rownames
11         size
12        parts

1                           values
2                        ontime_DB
3                          unknown
     .
     .
     .

hdfs.download

Copies a file from HDFS to the local file system.

Usage

hdfs.download(
        dfs.id,
        filename, 
        overwrite)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

filename

The name of a file in the local file system where the data is copied.

overwrite

Controls whether the operation can overwrite an existing local file. Set to TRUE to overwrite filename, or FALSE to signal an error (default).

Usage Notes

This function provides the fastest and easiest way to copy a file from HDFS. No data transformations occur except merging multiple parts into a single file. The local file has the exact same data as the HDFS file.

Return Value

Local file name, or NULL if the copy failed

Example

This example displays a list of files in the current HDFS directory and copies ontime2000.DB to the local file system as /home/oracle/ontime2000.dat.

R> hdfs.ls()
[1] "ontime2000_DB" "ontime_DB"     "ontime_File"   "ontime_R"      "testdata.dat" 
R> tmpfile <- hdfs.download("ontime2000_DB", "/home/oracle/ontime2000.dat", overwrite=F)
R> tmpfile
[1] "/home/oracle/ontime2000.dat"

hdfs.exists

Verifies that a file exists in HDFS.

Usage

hdfs.exists(
        dfs.id)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

Usage Notes

If this function returns TRUE, then you can attach the data and use it in a hadoop.run function. You can also use this function to validate an HDFS identifier and ensure that the data exists.

Return Value

TRUE if the identifier is valid and the data exists, or FALSE if the object is not found

See Also

is.hdfs.id

Example

This example shows that the ontime_R file exists.

R> hdfs.exists("ontime_R")
[1] TRUE

hdfs.get

Copies data from HDFS into a data frame in the local R environment. All metadata is extracted and all attributes, such as column names and data types, are restored if the data originated in an R environment. Otherwise, generic attributes like val1 and val2 are assigned.

Usage

hdfs.get(
        dfs.id,
        sep)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

sep

The symbol used to separate fields in the file. A comma (,) is the default separator.

Usage Notes

If the HDFS file is small enough to fit into an in-memory R data frame, then you can copy the file using this function instead of the hdfs.pull function. The hdfs.get function can be faster, because it does not use Sqoop and thus does not have the overhead incurred by hdfs.pull.

Return Value

A data.frame object in memory in the local R environment pointing to the exported data set, or NULL if the operation failed

Example

This example returns the contents of a data frame named res.

R> print(hdfs.get(res))
  val1      val2
1   AA 1361.4643
2   AS  515.8000
3   CO 2507.2857
4   DL 1601.6154
5   HP  549.4286
6   NW 2009.7273
7   TW 1906.0000
8   UA 1134.0821
9   US 2387.5000
10  WN  541.1538

hdfs.id

Converts an HDFS path name to an R dfs.id object.

Usage

hdfs.id(
        dfs.x,
        force)

Arguments

dfs.x

A string or text expression that resolves to an HDFS file name.

force

Set to TRUE if the file need not exist, or set to FALSE to ensure that the file does exist.

Return Value

TRUE if the string matches an HDFS file name, or NULL if a file by that name is not found

Example

This example creates a dfs.id object for /user/oracle/demo:

R> hdfs.id('/user/oracle/demo')
[1] "user/oracle/demo"
attr(,"dfs.id")
[1] TRUE

The next example creates a dfs.id object named id for a nonexistent directory named /user/oracle/newdemo, after first failing:

R> id<-hdfs.id('/user/oracle/newdemo')
DBG: 16:11:38 [ER] "/user/oracle/newdemo" is not found
R> id<-hdfs.id('/user/oracle/newdemo', force=T)
R> id
[1] "user/oracle/newdemo"
attr(,"dfs.id")
[1] TRUE

hdfs.ls

Lists the names of all HDFS directories containing data in the specified path.

Usage

hdfs.ls(dfs.path)

Arguments

dfs.path

A path relative to the current default path. The default path is the current working directory.

Return Value

A list of data object names in HDFS, or NULL if the specified path is invalid

See Also

hdfs.cd

Example

This example lists the subdirectories in the current directory:

R> hdfs.ls()
[1] "ontime_DB"   "ontime_FILE"   "ontime_R"

The next example lists directories in the parent directory:

R> hdfs.ls("..")
[1] "demo"   "input"   "output"   "sample"   "xq"

This example returns NULL because the specified path is not in HDFS.

R> hdfs.ls("/bin")
NULL

hdfs.mkdir

Creates a subdirectory in HDFS relative to the current working directory.

Usage

hdfs.mkdir(
        dfs.name,
        cd)

Arguments

dfs.name

Name of the new directory.

cd

TRUE to change the current working directory to the new subdirectory, or FALSE to keep the current working directory (default).

Return Value

Full path of the new directory as a string, or NULL if the directory was not created

Example

This example creates the /user/oracle/sample directory.

R> hdfs.mkdir('sample', cd=T)
[1] "/user/oracle/sample"
attr(,"dfs.path")
[1] TRUE

hdfs.mv

Moves an HDFS file from one location to another.

Usage

hdfs.mv(
        dfs.src,
        dfs.dst,
        force)

Arguments

dfs.src

The name of the source file to be moved. The file name can include a path that is either absolute or relative to the current path.

dfs.dst

The name of the moved file. The file name can include a path that is either absolute or relative to the current path.

force

Set to TRUE to overwrite an existing destination file, or FALSE to cancel the operation and display an error message (default).

Return Value

NULL for a successful copy, or FALSE for a failed attempt

Example

This example moves a file named weblog to the demo subdirectory and overwrites the existing weblog file:

R> hdfs.mv("weblog", "./demo", force=T)

hdfs.parts

Returns the number of parts composing a file in HDFS.

Usage

hdfs.parts(
        dfs.id)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

Usage Notes

HDFS splits large files into parts, which provide a basis for the parallelization of MapReduce jobs. The more parts an HDFS file has, the more mappers can run in parallel.

Return Value

The number of parts composing the object, or 0 if the object does not exist in HDFS

Example

This example shows that the ontime_R file in HDFS has one part:

R> hdfs.parts("ontime_R")
[1] 1

hdfs.pull

Copies data from HDFS into an Oracle database.

This operation requires authentication by Oracle Database. See orch.connect.

Usage

hdfs.pull(
        dfs.id,
        sep,
        db.name,
        overwrite,
        driver)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

sep

The symbol used to separate fields in the file (optional). A comma (,) is the default separator.

db.name

The name of a table in an Oracle database (optional).

overwrite

Controls whether db.name can overwrite a table with the same name. Set to TRUE to overwrite the table, or FALSE to signal an error (default).

driver

Identifies the driver used to copy the data. This argument is currently ignored because Sqoop is the only supported driver.

Usage Notes

Because this operation is synchronous, copying a large data set may take a while. The prompt reappears and you regain use of R when copying is complete.

To copy large volumes of data into an Oracle database, consider using Oracle Loader for Hadoop. With the Oracle Advanced Analytics option, you can use Oracle R Enterprise to analyze the data.

Return Value

An ore.frame object that points to the database table with data loaded from HDFS, or NULL if the operation failed

See Also

Oracle R Enterprise User's Guide for a description of ore.frame objects.


hdfs.push

Copies data from an Oracle database to HDFS.

This operation requires authentication by Oracle Database. See orch.connect.

Note:

The Oracle R Enterprise library (ORE) must be attached for you to use this function.

Usage

hdfs.push(
        x,
        key,
        dfs.name,
        overwrite,
        driver,
        split.by)

Arguments

x

An ore.frame object with the data in an Oracle database to be pushed.

key

The index or name of the key column.

dfs.name

Unique name for the object in HDFS.

overwrite

TRUE to allow dfs.name to overwrite an object with the same name, or FALSE to signal an error (default).

driver

Identifies the driver used to copy the data. This argument is currently ignored because Sqoop is the only supported driver.

split.by

The column to use for data partitioning (optional).

Usage Notes

Because this operation is synchronous, copying a large data set may take a while. The prompt reappears and you regain use of R when copying is complete.

An ore.frame object is an Oracle R Enterprise metadata object that points to a database table. It corresponds to an R data.frame object.

Return Value

The full path to the file that contains the data set, or NULL if the operation failed

See Also

Oracle R Enterprise User's Guide

Example

This example creates an ore.frame object named ontime_s2000 that contains the rows from the ONTIME_S database table in where the year equals 2000. Then hdfs.push uses ontime_s2000 to create /user/oracle/xq/ontime2000_DB in HDFS.

R> ontime_s2000 <- ONTIME_S[ONTIME_S$YEAR == 2000,]
R> class(ontime_s2000)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> ontime2000.dfs <- hdfs.push(ontime_s2000, key='DEST', dfs.name='ontime2000_DB')
R> ontime2000.dfs
[1] "/user/oracle/xq/ontime2000_DB"
attr(,"dfs.id")
[1] TRUE

hdfs.put

Copies data from an Oracle database to HDFS. Column names, data types, and other attributes are stored as metadata in HDFS.

Note:

The Oracle R Enterprise library (ORE) must be attached for you to use this function.

Usage

hdfs.put(
        data,
        key,
        dfs.name,
        overwrite,
        rownames)

Arguments

data

An ore.frame object in the local R environment to be copied to HDFS.

key

The index or name of the key column.

dfs.name

A unique name for the new file.

overwrite

Controls whether dfs.name can overwrite a file with the same name. Set to TRUE to overwrite the file, or FALSE to signal an error.

rownames

Set to TRUE to add a sequential number to the beginning of each line of the file, or FALSE otherwise.

Usage Notes

You can use hdfs.put instead of hdfs.push to copy data from ore.frame objects, such as database tables, to HDFS. The table must be small enough to fit in R memory; otherwise, the function fails. The hdfs.put function first reads all table data into R memory and then transfers it to HDFS. For a small table, this function can be faster than hdfs.push because it does not use Sqoop and thus does not have the overhead incurred by hdfs.push.

Return Value

The object ID of the new file, or NULL if the operation failed

Example

This example creates a file named /user/oracle/xq/testdata.dat with the contents of the dat data frame.

R> myfile <- hdfs.put(dat, key='DEST', dfs.name='testdata.dat')
R> print(myfile)
[1] "/user/oracle/xq/testdata.dat"
attr(,"dfs.id")
[1] TRUE

hdfs.pwd

Identifies the current working directory in HDFS.

Usage

hdfs.pwd()

Return Value

The current working directory, or NULL if your R environment is not connected to HDFS

Example

This example shows that /user/oracle is the current working directory.

R> hdfs.pwd()
[1] "/user/oracle/"

hdfs.rm

Removes a file or directory from HDFS.

Usage

hdfs.rm(
        dfs.id,
        force)

Arguments

dfs.id

The name of a file or directory in HDFS. The name can include a path that is either absolute or relative to the current path.

force

Controls whether a directory that contains files is deleted. Set to TRUE to delete the directory and all its files, or FALSE to cancel the operation (default).

Usage Notes

All object identifiers in Hadoop pointing to this data are invalid after this operation.

Return Value

TRUE if the data is deleted, or FALSE if the operation failed

Example

This example removes the file named data1.log in the current working HDFS directory:

R> hdfs.rm("data1.log")
[1] TRUE

hdfs.rmdir

Deletes a directory in HDFS.

Usage

hdfs.rmdir(
        dfs.name,
        force)

Arguments

dfs.name

Name of the directory in HDFS to delete. The directory can be an absolute path or relative to the current working directory.

force

Controls whether a directory that contains files is deleted. Set to TRUE to delete the directory and all its files, or FALSE to cancel the operation (default).

Usage Notes

This function deletes all data objects stored in the directory, which invalidates all associated object identifiers in HDFS.

Return Value

TRUE if the directory is deleted successfully, or FALSE if the operation fails

Example

R> hdfs.rmdir("mydata")
[1] TRUE

hdfs.root

Returns the HDFS root directory.

Usage

hdfs.root()

Return Value

A data frame with the full path of the HDFS root directory

Example

This example identifies /user/oracle as the root directory of HDFS.

R> hdfs.root()
[1] "/user/oracle"

hdfs.sample

Copies a random sample of data from a Hadoop file into an R in-memory object. Use this function to copy a small sample of the original HDFS data for developing the R calculation that you ultimately want to execute on the entire HDFS data set on the Hadoop cluster.

Usage

hdfs.sample(
        dfs.id,
        lines,
        sep)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

lines

The number of lines to return as a sample. The default value is 1000 lines.

sep

The symbol used to separate fields in the Hadoop file. A comma (,) is the default separator.

Usage Notes

If the data originated in an R environment, then all metadata is extracted and all attributes are restored, including column names and data types. Otherwise, generic attribute names, like val1 and val2, are assigned.

Return Value

A data.frame object with the sample data set, or NULL if the operation failed

Example

This example displays the first three lines of the ontime_R file.

R> hdfs.sample("ontime_R", lines=3)
  YEAR MONTH MONTH2 DAYOFMONTH DAYOFMONTH2 DAYOFWEEK DEPTIME...
1 2000    12     NA         31          NA         7     1730...
2 2000    12     NA         31          NA         7     1752...
3 2000    12     NA         31          NA         7     1803...

hdfs.setroot

Sets the HDFS root directory.

Usage

hdfs.setroot(dfs.root)

Arguments

dfs.root

The full path of the root directory.

Usage Notes

Use hdfs.root to see the current root directory.

Return Value

None

Example

This example changes the HDFS root directory from /user/oracle to /user/oracle/demo.

R> hdfs.root()
[1] "/user/oracle"
R> hdfs.setroot("/user/oracle/demo")
R> hdfs.root()
[1] "/user/oracle/demo"

hdfs.size

Returns the size of a file in HDFS.

Usage

hdfs.size(
        dfs.id,
        units)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

units

Specifies a unit of measurement for the return value:

  • KB (kilobytes)

  • MB (megabytes)

  • GB (gigabytes)

  • TB (terabytes)

  • PB (petabytes)

The unit defaults to bytes if you omit the argument or enter an unknown value.

Usage Notes

Use this interface to determine, for instance, whether you can copy the contents of an entire HDFS file into local R memory or a local file, or if you can only sample the data while creating a prototype of your R calculation.

Return Value

Size of the object, or 0 if the object does not exist in HDFS

Example

This example returns a file size for ontime_R of 999,839 bytes.

R> hdfs.size("ontime_R")
[1] 999839

hdfs.upload

Copies a file from the local file system into HDFS.

Usage

hdfs.upload(
        filename,
        dfs.name, 
        overwrite,
        split.size,
        header)

Arguments

filename

Name of a file in the local file system.

dfs.name

Name of the new directory in HDFS.

overwrite

Controls whether dfs.name can overwrite a directory with the same name. Set to TRUE to overwrite the directory, or FALSE to signal an error (default).

split.size

Maximum number of bytes in each part of the Hadoop file (optional).

header

Indicates whether the first line of the local file is a header containing column names. Set to TRUE if it has a header, or FALSE if it does not (default).

A header enables you to exact the column names and reference the data fields by name instead of by index in your MapReduce R scripts.

Usage Notes

This function provides the fastest and easiest way to copy a file into HDFS. If the file is larger than split.size, then Hadoop splits it into two or more parts. The new Hadoop file gets a unique object ID, and each part is named part-0000x. Hadoop automatically creates metadata for the file.

Return Value

HDFS object ID for the loaded data, or NULL if the copy failed

See Also

hdfs.download, hdfs.get, hdfs.put

Example

This example uploads a file named ontime_s2000.dat into HDFS and shows the location of the file, which is stored in a variable named ontime.dfs_File.

R> ontime.dfs_File <- hdfs.upload('ontime_s2000.dat', dfs.name='ontime_File')
R> print(ontime.dfs_File)
[1] "/user/oracle/xq/ontime_File"

is.hdfs.id

Indicates whether an R object contains a valid HDFS file identifier.

Usage

is.hdfs.id(x)

Arguments

x

The name of an R object.

Return Value

TRUE if x is a valid HDFS identifier, or FALSE if it is not

See Also

hdfs.attach, hdfs.id

Example

This example shows that dfs contains a valid HDFS identifier, which was returned by hdfs.attach:

R> dfs <- hdfs.attach('ontime_R')
R> is.hdfs.id(dfs)
[1] TRUE
R> print(dfs)
[1] "/user/oracle/xq/ontime_R"
attr(,"dfs.id")
[1] TRUE

The next example shows that a valid file name passed as a string is not recognized as a valid file identifier:

R> is.hdfs.id('/user/oracle/xq/ontime_R')
[1] FALSE

orch.connect

Establishes a connection to Oracle Database.

Usage

orch.connect(
        host,
        user,
        sid,
        passwd,
        port, 
        secure,
        driver,
        silent)

Arguments

host

Host name or IP address of the server where Oracle Database is running.

user

Database user name.

sid

System ID (SID) for the Oracle Database instance.

passwd

Password for the database user. If you omit the password, you are prompted for it.

port

Port number for the Oracle Database listener. The default value is 1521.

secure

Authentication setting for Oracle Database:

  • TRUE: You must enter a database password each time you attempt to connect (default).

  • FALSE: You must enter a database password only once during a session. The encrypted password is kept in memory and used for subsequent connection attempts.

driver

Driver used to connect to Oracle Database (optional). Sqoop is the default driver.

silent

TRUE to suppress the prompts for missing host, user, password, port, and SID values, or FALSE to see them (default).

Usage Notes

Use this function when your analysis requires access to data stored in an Oracle database or to return the results to the database.

With an Oracle Advanced Analytics license for Oracle R Enterprise and a connection to Oracle Database, you can work directly with the data stored in database tables and pass processed data frames to R calculations on Hadoop.

You can reconnect to Oracle Database using the connection object returned by the orch.dbcon function.

Return Value

TRUE for a successful and validated connection, or FALSE for a failed connection attempt

See Also

orch.dbcon, orch.disconnect

Example

This example installs the ORCH library and connects to Oracle Database on the local system:

R> library(ORCH)
Oracle R Connector for Hadoop 0.1.8 (rev.102)
Hadoop 0.20.2-cdh3u3 is up
Sqoop 1.3.0-cdh3u3 is up
R> orch.connect("localhost", "RQUSER", "orcl")
Connecting ORCH to RDBMS via [sqoop]
    Host: localhost
    Port: 1521
    SID: orcl
    User: RQUSER
Enter password for [RQUSER]: password
Connected.
[1] TRUE

The next example uses a connection object to reconnect to Oracle Database:

R> conn<-orch.dbcon()
R> orch.disconnect()
Disconnected from a database.

R> orch.connect(conn)
Connecting ORCH to RDBMS via [sqoop]
    Host: localhost
    Port: 1521
    SID: orcl
    User: RQUSER
Enter password for [RQUSER]: password
Connected
[1] TRUE

orch.dbcon

Returns a connection object for the current connection to Oracle Database, excluding the authentication credentials.

Usage

orch.dbcon()

Return Value

A data frame with the connection settings for Oracle Database

Usage Notes

Use the connection object returned by orch.dbcon to reconnect to Oracle Database using orch.connect.

See Also

orch.connect

Example

This example shows how you can reconnect to Oracle Database using the connection object returned by orch.dbcon:

R> orch.connect('localhost', 'RQUSER', 'orcl')
Connecting ORCH to RDBMS via [sqoop]
    Host: localhost
    Port: 1521
    SID: orcl
    User: RQUSER
Enter password for [RQUSER]: password
Connected
[1] TRUE

R> conn<-orch.dbcon()
R> orch.disconnect()
Disconnected from a database.

R> orch.connect(conn)
Connecting ORCH to RDBMS via [sqoop]
    Host: localhost
    Port: 1521
    SID: orcl
    User: RQUSER
Enter password for [RQUSER]: password
Connected
[1] TRUE

The following shows the connection object returned by orch.dbcon in the previous example:

R> conn
Object of class "orch.dbcon"
data frame with 0 columns and 0 rows
Slot "ok":
[1] TRUE

Slot "host":
[1] "localhost"

Slot "port":
[1] 1521

Slot "sid"
[1] "orcl"

Slot "user":
[1] "RQUSER"

Slot "passwd":
[1] ""

Slot "secure":
[1] TRUE

Slot "drv":
[1] "sqoop"

orch.dbg.off

Turns off debugging mode.

Usage

orch.dbg.off()

Return Value

FALSE

See Also

orch.dbg.on

Example

This example turns off debugging:

R> orch.dbg.off()

orch.dbg.on

Turns on debugging mode.

Usage

orch.dbg.on(severity)

Arguments

severity

Identifies the type of error messages that are displayed. You can identify the severity by the number or by the case-insensitive keyword shown in Table 5-1.

Table 5-1 Debugging Severity Levels

Keyword Number Description

all

11

Return all messages.

critical

1

Return only critical errors.

error

2

Return all errors.

warning

3

Return all warnings.


Return Value

The severity level

See Also

orch.dbg.output, orch.dbg.off

Example

This example turns on debugging for all errors:

R> severe<-orch.dbg.on(severity<-2)
R> severe
[1] "ERROR" "2"

orch.dbg.output

Directs the output from the debugger.

Usage

orch.dbg.output(con)

Arguments

con

Identifies the stream where the debugging information is sent: stderr(), stdout(), or a file name.

Usage Notes

You must first turn on debugging mode before redirecting the output.

Return Value

The current stream

See Also

orch.dbg.on

Example

This example turns on debugging mode and sends the debugging information to stderr. The orch.dbg.output function returns a description of stderr.

R> orch.dbg.on('all')
R> err<-orch.dbg.output(stderr())
17:32:11 [SY] debug output set to "stderr"
R> print(err)
description          class     mode      text    opened    can read    can write
   "stderr"     "terminal"      "w"    "text"   "opened"       "no"        "yes"

The next example redirects the output to a file named debug.log:

R> err<-orch.dbg.output('debug.log')
17:37:45 [SY] debug output set to "debug.log"
R> print(err)
[1] "debug.log"

orch.dbinfo

Provides information about the current connection.

Usage

orch.dbinfo(dbcon)

Arguments

dbcon

An Oracle Database connection object.

See Also

orch.dbcon


orch.disconnect

Disconnects the local R session from Oracle Database.

Usage

orch.disconnect(
        silent,
        dbcon)

Arguments

silent

Set to TRUE to suppress all messages, or FALSE to see them (default).

dbcon

Set to TRUE to display the connection details, or FALSE to suppress this information (default).

Usage Notes

No orch functions work without a connection to Oracle Database.

Return Value

An Oracle Database connection object when dbcon is TRUE, otherwise NULL

See Also

orch.connect, orch.reconnect

Example

This example disconnects the local R session from Oracle Database:

R> orch.disconnect()
Disconnected from a database.

The next example disconnects the local R session from Oracle Database and displays the returned connection object:

R> oid<-orch.disconnect(silent=TRUE, dbcon=TRUE)
R> oid
Object of class "orch.dbcon"
data frame with 0 columns and 0 rows
slow "ok":
[1] TRUE

Slot "host":
[1] "localhost"

Slow "port":
[1] 1521

Slot "sid":
[1] orcl

Slot "user":
[1] RQUSER

Slot "passwd":
[1] ""

Slot "secure":
[1] TRUE

Slot "drv":
[1] "sqoop"

orch.dryrun

Switches the execution platform between the local host and the Hadoop cluster. No changes in the R code are required for a dry run.

Usage

orch.dryrun(onoff)

Arguments

onoff

Set to TRUE to run a MapReduce program locally, or FALSE to run the program on the Hadoop cluster.

Usage Notes

The orch.dryrun function enables you to run a MapReduce program locally on a laptop using a small data set before running it on a Hadoop cluster using a very large data set. The mappers and reducers are run sequentially on row streams from HDFS. The Hadoop cluster is not required for a dry run.

Return Value

The current setting of orch.dryrun

See Also

hadoop.exec, hadoop.run

Example

This example changes the value of orch.dryrun from FALSE to TRUE.

R> orch.dryrun()
[1] FALSE
R> orch.dryrun(onoff<-T)
R> orch.dryrun()
[1] TRUE

orch.export

Makes R objects from a user's local R session available in the Hadoop execution environment, so that they can be referenced in MapReduce jobs.

Usage

orch.export(...)

Arguments

. . .

One or more variables, data frames, or other in-memory objects, by name or as an explicit definition, in a comma-separated list.

Usage Notes

You can use this function to prepare local variables for use in hadoop.exec and hadoop.run functions. The mapper, reducer, combiner, init, and final arguments can reference the exported variables.

Return Value

A list object

See Also

hadoop.exec, hadoop.run

Example

This code fragment shows orch.export used in the export argument of the hadoop.run function:

hadoop.run(x,
    export = orch.export(a=1, b=2),
    mapper = function(k,v) {
        x <- a + b
        orch.keyval(key=NULL, val=x)
    }
)

The following is a similar code fragment, except that the variables are defined outside the hadoop.run function:

a=1
b=2
hadoop.run(x,
    export = orch.export(a, b),
    .
    .
    .
)

orch.keyval

Outputs key-value pairs in a MapReduce job.

Usage

orch.keyval(
        key,
        value)

Arguments

key

A scalar value.

value

A data structure such as a scalar, list, data frame, or vector.

Usage Notes

This function can only be used in the mapper, reducer, or combiner arguments of hadoop.exec and hadoop.run. Because the orch.keyval function is not exposed in the ORCH client API, you cannot call it anywhere else.

See Also

orch.pack

Return Value

(key, value) structures

Example

This code fragment creates a mapper function using orch.keyval:

hadoop.run(data,
     mapper = function(k,v) {
        orch.keyval(k,v)
     })

orch.keyvals

Outputs a set of key-value pairs in a MapReduce job.

Usage

orch.keyvals(
        key,
        value)

Arguments

key

A scalar value.

value

A data structure such as a scalar, list, data frame, or vector.

Usage Notes

This function can only be used in the mapper, reducer, or combiner arguments of hadoop.exec and hadoop.run. Because the orch.keyvals function is not exposed in the ORCH client API, you cannot call it anywhere else.

See Also

orch.keyval, orch.pack

Return Value

(key, value) structures

Example

This code fragment creates a mapper function using orch.keyval and a reducer function using orch.keyvals:

hadoop.run(data,
     mapper(k,v) {
          if (v$value > 10) {
               orch.keyval(k, v)
          }
          else {
               NULL
          }
     },
     reducer(k,vals) {
          orch.keyvals(k,vals)
}
)

The following code fragment shows orch.keyval in a for loop to perform the same reduce operation as orch.keyvals in the previous example:

reducer(k,vals) {
          out <- list()
          for (v in vals) {
               out <- list(out, orch.keyval(k,vals))
          }
          out
     }

orch.pack

Compresses one or more in-memory R objects that the mappers or reducers must write as the values in key-value pairs.

Usage

orch.pack(...)

Arguments

. . .

One or more variables, data frames, or other in-memory objects in a comma-separated list.

Usage Notes

This function requires the bitops package, which you can download from the Comprehensive R Archive Network (CRAN) at

http://cran.r-project.org/

You should use this function when passing nonscalar or complex R objects, such as data frames and R classes, between the mapper and reducer functions. You do not need to use it on scalar or other simple objects. You can use orch.pack to vary the data formats, data sets, and variable names for each output value.

You should also use orch.pack when storing the resultant data set in HDFS. The compressed data set is not corrupted by being stored in an HDFS file.

The orch.pack function must always be followed by the orch.unpack function to restore the data to a usable format.

Return Value

Compressed character-type data as a long string with no special characters

See Also

hadoop.exec, hadoop.run, orch.keyval, orch.unpack

Example

This code fragment compresses the content of several R objects into a serialized stream using orch.pack, and then creates key-value pairs using orch.keyval:

orch.keyval(NULL, orch.pack(
     r = r,
     qy = qy,
     yy = yy,
     nRows = nRows))

orch.reconnect

Reconnects to Oracle Database with the credentials previously returned by orch.disconnect.

Note:

The orch.reconnect function is deprecated in release 1.1 and will be desupported in release 2.0. Use orch.connect to reconnect using a connection object returned by orch.dbcon.

Usage

orch.reconnect(dbcon)

Arguments

dbcon

Credentials previously returned by orch.disconnect.

Usage Notes

Oracle R Connector for Hadoop preserves all user credentials and connection attributes, enabling you to reconnect to a previously disconnected session. Depending on the orch.connect secure setting for the original connection, you may be prompted for a password. After reconnecting, you can continue data transfer operations between Oracle Database and HDFS.

Reconnecting to a session is faster than opening a new one, because reconnecting does not require extensive connectivity checks.

Return Value

TRUE for a successfully reestablished and validated connection, or FALSE for a failed attempt

See Also

orch.connect

Example

R> orch.reconnect(oid)
Connecting ORCH to RDBMS via [sqoop]
    Host: localhost
    Port: 1521
    SID:  orcl
    User: RQUSER
Enter password for [RQUSER]: password
Connected
[1] TRUE

orch.unpack

Restores the R objects that were compressed with a previous call to orch.pack.

Usage

orch.unpack(...)

Arguments

. . .

The name of a compressed object.

Usage Notes

This function requires the bitops package, which you can download from a Comprehensive R Archive Network (CRAN) mirror site at

http://cran.r-project.org/

This function is typically used at the beginning of a mapper or reducer function to obtain and prepare the input. However, it can also be used externally, such as at the R console, to unpack the results of a MapReduce job.

Consider this data flow:

ORCH client to mapper to combiner to reducer to ORCH client.

If the data is packed at one stage, then it must be unpacked at the next stage.

Return Value

Uncompressed list-type data, which can contain any number and any type of variables

See Also

orch.pack

Example

This code fragment restores the data that was compressed in the orch.pack example:

reducer = function(key, vals) {
     x    <-orch.unpack(vals[[1]])
     r    <-x$qy
     yy   <- x$yy
     nRow <- x$nRows
     .
     .
     .
     }

orch.version

Identifies the version of the ORCH package.

Usage

orch.version()

Return Value

ORCH package version number

Example

This example shows that the ORCH version number is 0.1.8.

R> orch.version()
[1] "0.1.8"