Text File Adapter

The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:

Built-in Functions for Reading and Writing Text Files

To use the built-in functions in your query, you must import the text file module as follows:

import module "oxh:text";

The text file module contains the following functions:

For examples, see "Examples of Text File Adapter Functions."

text:collection

Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.

The files might be split up and processed in parallel by multiple tasks.

Signature

declare %text:collection("text") function
   text:collection($uris as xs:string*) as xs:string* external;

declare %text:collection("text") function
   function text:collection($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;

Parameters

$uris: The text file URIs.

$delimiter: A custom delimiter on which the file is split. The default is the newline character.

Returns

One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file

text:collection-xml

Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.

The files might be split up and processed in parallel by multiple tasks. Each delimited section of each file is parsed as an XML document and returned by the function. Therefore, each segment must fully contain a single XML document, and any delimit characters in the XML must be escaped with XML character references. By default, the delimiter is a new line.

Signature

declare %text:collection("xml") function 
   text:collection-xml($uris as xs:string*) as document-node()* external;

declare %text:collection("xml") function 
   text:collection-xml($uris as xs:string*, $delimiter as xs:string?) as document-node()* external;

Parameters

$uris: The text file URIs.

$delimiter: A custom delimiter on which the file is split. The default is the newline character.

Returns

One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file

text:put

Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.

Signature

declare %text:put("text") function
   text:put($value as xs:string) external;

Parameters

$value: The text to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

text:put-xml

Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.

Newline characters in the serialized XML are replaced with character references to ensure that the XML does not span multiple lines. For example, 
 replaces the linefeed character (\n).

Signature

declare %text:put("xml") function
   text:put-xml($value as node()) external;

Parameters

$value: The XML to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

text:trace

Writes a line to a text file named trace-* in the output directory of the query. The lines are spread across one or more files.

This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.

Signature

declare %text:put("text") %text:file("trace") function
   text:trace($value as xs:string) external;

Parameters

$value: The text to write

Returns

empty-sequence()

Custom Functions for Reading Text Files

You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

The input files can be compressed with a Hadoop-supported compression codec. They are automatically decompressed when read.

Signature

Custom functions for reading text files must have one of the following signatures:

declare %text:collection("text") [additional annotations]
   function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;
 
declare %text:collection("text") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as xs:string* external;
 
declare %text:collection("xml") [additional annotations]
   function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as document-node()* external

declare %text:collection("xml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%text:collection(["method"])

Declares the text collection function. Required.

The optional method parameter can be one of the following values:

  • text: Each line in the text file is returned as xs:string. Default.

  • xml: Each line in the text file is parsed as XML and returned as document-node. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.

%text:split("delimiter")

Specifies a custom delimiter for splitting the input files. The default delimiter is the newline character.

Do not combine this annotation with the $delimiter parameter. To specify a custom delimiter, use either this annotation or the $delimiter parameter.

%text:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%text:split-max(1024)
%text:split-max("1024")
%text:split-max("1K")
%text:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%text:split-min(1024)
%text:split-min("1024")
%text:split-min("1K")

Parameters

$uris as xs:string*

Lists the HDFS file URIs. The files can be uncompressed or compressed with a Hadoop-supported codec. Required.

$delimiter as xs:string?

A custom delimiter on which the input text files are split. The default delimiter is a new line. Do not combine this parameter with the %text:split annotation.

Returns

xs:string* for the text method

document-node()* for the xml method

Custom Functions for Writing Text Files

You can use the following annotations to define functions that write text files in HDFS.

Signature

Custom functions for writing text files must have one of the following signatures:

declare %text:put("text") [additional annotations] function 
   text:myFunctionName($value as xs:string) external;
 
declare %text:put("xml") [additional annotations] function 
   text:myFunctionName($value as node()) external;

Annotations

%text:put(["method"])

Declares the text put function. Required.

The optional method parameter can be one of the following values:

  • text: Writes data to a text file. Default.

  • xml: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This process ensures that the resulting XML document is one text line with no line breaks.

%text:compress("codec")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

  1. The fully qualified class name of the codec

  2. The unqualified class name of the codec

  3. The prefix of the unqualified class name before "Codec" (case insensitive)

All of these examples use the default codec and block compression:

%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%text:compress("DefaultCodec", "block")
%text:compress("default", "block") 
%text:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put. See "Serialization Annotations."

UTF-8 is currently the only supported character encoding.

Examples of Text File Adapter Functions

Example 1   Using Built-in Functions to Query Text Files

This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.

mydata/visits1.log  
 
2013-10-28T06:00:00, john, index.html, 200
2013-10-28T08:30:02, kelly, index.html, 200
2013-10-28T08:32:50, kelly, about.html, 200
2013-10-30T10:00:10, mike, index.html, 401

mydata/visits2.log  

2013-10-30T10:00:01, john, index.html, 200
2013-10-30T10:05:20, john, about.html, 200
2013-11-01T08:00:08, laura, index.html, 200
2013-11-04T06:12:51, kelly, index.html, 200
2013-11-04T06:12:40, kelly, contact.html, 200

The following query filters out the pages visited by john and writes only the date and page visited to a new text file:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
where $split[2] eq "john"
return
   text:put($split[1] || " " || $split[3])
 

This query creates a text file that contains the following lines:

2013-10-28T06:00:00 index.html
2013-10-30T10:00:01 index.html
2013-10-30T10:05:20 about.html

The next query computes the number of page visits per day:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $time := xs:dateTime($split[1])
let $day := xs:date($time)
group by $day
return
   text:put($day || " => " || count($line))
 

The query creates text files that contain the following lines:

2013-10-28 => 3
2013-10-30 => 3
2013-11-01 => 1
2013-11-04 => 2
Example 2   Querying Simple Delimited Formats

This example uses the fn:tokenize function to parse the lines of a text file. This technique works well for simple delimited formats.

The following query declares custom put and collection functions. It computes the number of hits and the number of unique users for each page in the logs.

import module "oxh:text";

declare
   %text:collection("text")
   %text:split-max("32m")
function local:col($uris as xs:string*) as xs:string* external;
 
declare
   %text:put("xml")
   %text:compress("gzip") 
   %text:file("pages")
function local:out($arg as node()) external;
 
for $line in local:col("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $user := $split[2]
let $page := $split[3]
group by $page
return
   local:out(
      <page>
         <name>{$page}</name>
         <hits>{count($line)}</hits>
         <users>{fn:count(fn:distinct-values($user))}</users>
      </page>
   )

The output directory of the previous query is named myoutput. The following lines are written to myoutput/pages-r-*.gz.

<page><name>about.html</name><hits>2</hits><users>2</users></page>
<page><name>contact.html</name><hits>1</hits><users>1</users></page>
<page><name>index.html</name><hits>6</hits><users>4</users></page>

The files are compressed with the gzip codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.

import module "oxh:text";

for $page in text:collection-xml("myoutput/page*.gz")/page
return 
   text:put($page/name || "," || $page/hits)
 

This query creates text files that contains the following lines:

about.html,2
contact.html,1
index.html,6
Example 3   Querying Complex Text Formats

The fn:tokenize function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string function to process a log file in the Apache Common Log format.

A text file named mydata/access.log in HDFS contains the following lines:

192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284
192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567
192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284
192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
 

The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string to match the components of the log entries. The time format cannot be cast directly to xs:dateTime, as shown in Example 2. Instead, the ora-fn:dateTime-from-string-with-format function converts the string to an instance of xs:dateTime.

import module "oxh:text";

declare variable $REGEX :=
   '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)';

for $line in text:collection("mydata/access.log")
let $match := fn:analyze-string($line, $REGEX)/fn:match
let $time := 
   ora-fn:dateTime-from-string-with-format(
      "dd/MMM/yyyy:HH:mm:ss Z",
      $match/fn:group[4]
   )
let $status := $match/fn:group[6]
where 
   $status eq "404" and 
   $time ge xs:dateTime("2013-10-01T00:00:00")
let $host := $match/fn:group[1]
let $request := $match/fn:group[5]
return
   text:put($host || "," || $request)

The query creates text files that contain the following lines:

192.0.2.1,GET /indexx.html HTTP/1.1
192.0.2.1,GET /aobut.html HTTP/1.1

See Also: