Text File Adapter

The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:


Built-in Functions for Reading and Writing Text Files

To use the built-in functions in your query, you must import the text file module as follows:

import module "oxh:text";

The text file module contains the following functions:

For examples, see "Examples of Text File Adapter Functions."

text:collection

Accesses a collection of text files in HDFS. The files might be split up and processed in parallel by multiple tasks.

Signature

declare %text:collection("text") function
   text:collection($uris as xs:string*) as xs:string* external;

Parameters

$uris

The text file URIs

Returns

One string value for each line in each file

text:collection-xml

Accesses a collection of text files in HDFS. The files might be split up and processed in parallel by multiple tasks. Each line in each file is parsed as an XML document and returned by the function. Therefore, each line must fully contain a single XML document, and any new lines in the XML must be escaped with XML character references.

Signature

declare %text:collection("xml") function 
   text:collection-xml($uris as xs:string*) as document-node()* external;

Parameters

$uris

The text file URIs

Returns

One XML document for each line in each file

text:put

Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.

Signature

declare %text:put("text") function
   text:put($value as xs:string) external;

Parameters

$value

The text to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part" (such as part-m-00000). You specify the output directory when the query executes. See "Running a Query."

text:put-xml

Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.

Newline characters in the serialized XML are replaced with character references to ensure the XML does not span multiple lines. For example, 
 replaces the linefeed character (\n).

Signature

declare %text:put("xml") function
   text:put-xml($value as node()) external;

Parameters

$value

The XML to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part" (such as part-m-00000). You specify the output directory when the query executes. See "Running a Query."

text:trace

Writes a line to a text file named trace-* in the output directory of the query. The lines are spread across one or more files.

This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.

Signature

declare %text:put("text") %text:file("trace") function
   text:trace($value as xs:string) external;

Parameters

$value

The text to write

Returns

empty-sequence()

Examples of Text File Adapter Functions

This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.

mydata/visits1.log  
 
2013-10-28T06:00:00, john, index.html, 200
2013-10-28T08:30:02, kelly, index.html, 200
2013-10-28T08:32:50, kelly, about.html, 200
2013-10-30T10:00:10, mike, index.html, 401

mydata/visits2.log  

2013-10-30T10:00:01, john, index.html, 200
2013-10-30T10:05:20, john, about.html, 200
2013-11-01T08:00:08, laura, index.html, 200
2013-11-04T06:12:51, kelly, index.html, 200
2013-11-04T06:12:40, kelly, contact.html, 200

The following query filters out the pages visited by john and writes only the date and page visited to a new text file:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
where $split[2] eq "john"
return
   text:put($split[1] || " " || $split[3])
 

The query creates a text file that contains the following lines:

2013-10-28T06:00:00 index.html
2013-10-30T10:00:01 index.html
2013-10-30T10:05:20 about.html

The next query computes the number of page visits per day:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $time := xs:dateTime($split[1])
let $day := xs:date($time)
group by $day
return
   text:put($day || " => " || count($line))
 

The query creates text files that contain the following lines:

2013-10-28 => 3
2013-10-30 => 3
2013-11-01 => 1
2013-11-04 => 2

Custom Functions for Reading Text Files

You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading text files must have one of the following signatures:

declare %text:collection("text") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as xs:string* external;
 
declare %text:collection("xml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%text:collection(["method"])

Declares the text collection function. Required.

The optional method parameter can be one of the following values:

  • text: Each line in the text file is returned as xs:string. Default.

  • xml: Each line in the text file is parsed as XML and returned as document-node. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.

%xmlf:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-max(1024)
%xmlf:split-max("1024")
%xmlf:split-max("1K")
%xmlf:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-min(1024)
%seq:split-min("1024")
%xmlf:split-min("1K")

Parameters

$uris as xs:string*

Lists the HDFS file URIs. Required.

Return Type

xs:string* for the text method if %seq:key=false or is not set

text()* if %seq:key=true or is set


Custom Functions for Writing Text Files

You can use the following annotations to define functions that write text files in HDFS.

Signature

Custom functions for writing text files must have one of the following signatures:

declare %text:put("text") [additional annotations] function 
   text:myFunctionName($value as xs:string) external;
 
declare %text:put("xml") [additional annotations] function 
   text:myFunctionName($value as node()) external;

Annotations

%text:put(["method"])

Declares the text put function. Required.

The optional method parameter can be one of the following values:

  • text: Writes data to a text file. Default.

  • xml: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This assures that the resulting XML document is one text line with no line breaks.

%text:compress("codec")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

  1. The fully qualified class name of the codec

  2. The unqualified class name of the codec

  3. The prefix of the unqualified class name before "Codec" (case insensitive)

All of these examples use the default codec and block compression:

%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%text:compress("DefaultCodec", "block")
%text:compress("default", "block") 
%text:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put. See "Serialization Annotations."

UTF-8 is currently the only supported character encoding.

Examples of Text File Functions

Example 1   Querying Simple Delimited Formats

This example uses the fn:tokenize function to parse the lines of a text file. This technique works well for simple delimited formats.

The following query declares custom put and collection functions. It computes the number of hits and the number of unique users for each page in the logs.

import module "oxh:text";

declare
   %text:collection("text")
   %text:split-max("32m")
function local:col($uris as xs:string*) as xs:string* external;
 
declare
   %text:put("xml")
   %text:compress("gzip") 
   %text:file("pages")
function local:out($arg as node()) external;
 
for $line in local:col("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $user := $split[2]
let $page := $split[3]
group by $page
return
   local:out(
      <page>
         <name>{$page}</name>
         <hits>{count($line)}</hits>
         <users>{fn:count(fn:distinct-values($user))}</users>
      </page>
   )

The output directory of the previous query is named myoutput. The following lines are written to myoutput/pages-r-*.gz.

<page><name>about.html</name><hits>2</hits><users>2</users></page>
<page><name>contact.html</name><hits>1</hits><users>1</users></page>
<page><name>index.html</name><hits>6</hits><users>4</users></page>

The files are compressed with the gzip codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.

import module "oxh:text";

for $page in text:collection-xml("myoutput/page*.gz")/page
return 
   text:put($page/name || "," || $page/hits)
 

This query creates text files that contains the following lines:

about.html,2
contact.html,1
index.html,6
Example 2   Querying Complex Text Formats

The fn:tokenize function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string function to process a log file in the Apache Common Log format.

A text file named mydata/access.log in HDFS contains the following lines:

192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284
192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567
192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284
192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
 

The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string to match the components of the log entries. The time format cannot be cast directly to xs:dateTime, as shown in Example 1. Instead, the ora-fn:dateTime-from-string-with-format function converts the string to an instance of xs:dateTime.

import module "oxh:text";

declare variable $REGEX :=
   '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)';

for $line in text:collection("mydata/access.log")
let $match := fn:analyze-string($line, $REGEX)/fn:match
let $time := 
   ora-fn:dateTime-from-string-with-format(
      "dd/MMM/yyyy:HH:mm:ss Z",
      $match/fn:group[4]
   )
let $status := $match/fn:group[6]
where 
   $status eq "404" and 
   $time ge xs:dateTime("2013-10-01T00:00:00")
let $host := $match/fn:group[1]
let $request := $match/fn:group[5]
return
   text:put($host || "," || $request)

The query creates text files that contain the following lines:

192.0.2.1,GET /indexx.html HTTP/1.1
192.0.2.1,GET /aobut.html HTTP/1.1

See Also: