hdfs.sample

Copies a random sample of data from a Hadoop file into an R in-memory object. Use this function to copy a small sample of the original HDFS data for developing the R calculation that you ultimately want to execute on the entire HDFS data set on the Hadoop cluster.

Usage

hdfs.sample(
        dfs.id,
        lines,
        sep)

Arguments

dfs.id

The name of a file in HDFS. The file name can include a path that is either absolute or relative to the current path.

lines

The number of lines to return as a sample. The default value is 1000 lines.

sep

The symbol used to separate fields in the Hadoop file. A comma (,) is the default separator.

Usage Notes

If the data originated in an R environment, then all metadata is extracted and all attributes are restored, including column names and data types. Otherwise, generic attribute names, like val1 and val2, are assigned.

This function can become slow when processing large input HDFS files, as the result of inherited limitations in the Hadoop command-line interface.

Return Value

A data.frame object with the sample data set, or NULL if the operation failed

Example

This example displays the first three lines of the ontime_R file.

R> hdfs.sample("ontime_R", lines=3)
  YEAR MONTH MONTH2 DAYOFMONTH DAYOFMONTH2 DAYOFWEEK DEPTIME...
1 2000    12     NA         31          NA         7     1730...
2 2000    12     NA         31          NA         7     1752...
3 2000    12     NA         31          NA         7     1803...