The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:
To use the built-in functions in your query, you must import the text file module as follows:
import module "oxh:text";
The text file module contains the following functions:
For examples, see "Examples of Text File Adapter Functions."
Accesses a collection of text files in HDFS. The files might be split up and processed in parallel by multiple tasks.
declare %text:collection("text") function text:collection($uris as xs:string*) as xs:string* external;
Accesses a collection of text files in HDFS. The files might be split up and processed in parallel by multiple tasks. Each line in each file is parsed as an XML document and returned by the function. Therefore, each line must fully contain a single XML document, and any new lines in the XML must be escaped with XML character references.
declare %text:collection("xml") function text:collection-xml($uris as xs:string*) as document-node()* external;
Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part" (such as part-m-00000). You specify the output directory when the query executes. See "Running a Query."
Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.
Newline characters in the serialized XML are replaced with character references to ensure the XML does not span multiple lines. For example, 

replaces the linefeed character (\n
).
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part" (such as part-m-00000). You specify the output directory when the query executes. See "Running a Query."
Writes a line to a text file named trace-* in the output directory of the query. The lines are spread across one or more files.
This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.
declare %text:put("text") %text:file("trace") function text:trace($value as xs:string) external;
This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.
mydata/visits1.log 2013-10-28T06:00:00, john, index.html, 200 2013-10-28T08:30:02, kelly, index.html, 200 2013-10-28T08:32:50, kelly, about.html, 200 2013-10-30T10:00:10, mike, index.html, 401 mydata/visits2.log 2013-10-30T10:00:01, john, index.html, 200 2013-10-30T10:05:20, john, about.html, 200 2013-11-01T08:00:08, laura, index.html, 200 2013-11-04T06:12:51, kelly, index.html, 200 2013-11-04T06:12:40, kelly, contact.html, 200
The following query filters out the pages visited by john
and writes only the date and page visited to a new text file:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") where $split[2] eq "john" return text:put($split[1] || " " || $split[3])
The query creates a text file that contains the following lines:
2013-10-28T06:00:00 index.html 2013-10-30T10:00:01 index.html 2013-10-30T10:05:20 about.html
The next query computes the number of page visits per day:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $time := xs:dateTime($split[1]) let $day := xs:date($time) group by $day return text:put($day || " => " || count($line))
The query creates text files that contain the following lines:
2013-10-28 => 3 2013-10-30 => 3 2013-11-01 => 1 2013-11-04 => 2
You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
Custom functions for reading text files must have one of the following signatures:
declare %text:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %text:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Declares the text
collection function. Required.
The optional method parameter can be one of the following values:
text
: Each line in the text file is returned as xs:string
. Default.
xml
: Each line in the text file is parsed as XML and returned as document-node
. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:
%xmlf:split-max(1024) %xmlf:split-max("1024") %xmlf:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%xmlf:split-min(1024) %seq:split-min("1024") %xmlf:split-min("1K")
You can use the following annotations to define functions that write text files in HDFS.
Custom functions for writing text files must have one of the following signatures:
declare %text:put("text") [additional annotations] function text:myFunctionName($value as xs:string) external; declare %text:put("xml") [additional annotations] function text:myFunctionName($value as node()) external;
Declares the text
put function. Required.
The optional method parameter can be one of the following values:
text
: Writes data to a text file. Default.
xml
: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This assures that the resulting XML document is one text line with no line breaks.
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before "Codec" (case insensitive)
All of these examples use the default codec and block compression:
%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %text:compress("DefaultCodec", "block") %text:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put
. See "Serialization Annotations."
UTF-8 is currently the only supported character encoding.
This example uses the fn:tokenize
function to parse the lines of a text file. This technique works well for simple delimited formats.
The following query declares custom put
and collection
functions. It computes the number of hits and the number of unique users for each page in the logs.
import module "oxh:text"; declare %text:collection("text") %text:split-max("32m") function local:col($uris as xs:string*) as xs:string* external; declare %text:put("xml") %text:compress("gzip") %text:file("pages") function local:out($arg as node()) external; for $line in local:col("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $user := $split[2] let $page := $split[3] group by $page return local:out( <page> <name>{$page}</name> <hits>{count($line)}</hits> <users>{fn:count(fn:distinct-values($user))}</users> </page> )
The output directory of the previous query is named myoutput. The following lines are written to myoutput/pages-r-*.gz.
<page><name>about.html</name><hits>2</hits><users>2</users></page> <page><name>contact.html</name><hits>1</hits><users>1</users></page> <page><name>index.html</name><hits>6</hits><users>4</users></page>
The files are compressed with the gzip
codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.
import module "oxh:text"; for $page in text:collection-xml("myoutput/page*.gz")/page return text:put($page/name || "," || $page/hits)
This query creates text files that contains the following lines:
about.html,2 contact.html,1 index.html,6
The fn:tokenize
function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string
function to process a log file in the Apache Common Log format.
A text file named mydata/access.log in HDFS contains the following lines:
192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284 192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567 192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284 192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string
to match the components of the log entries. The time format cannot be cast directly to xs:dateTime
, as shown in Example 1. Instead, the ora-fn:dateTime-from-string-with-format
function converts the string to an instance of xs:dateTime
.
import module "oxh:text"; declare variable $REGEX := '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)'; for $line in text:collection("mydata/access.log") let $match := fn:analyze-string($line, $REGEX)/fn:match let $time := ora-fn:dateTime-from-string-with-format( "dd/MMM/yyyy:HH:mm:ss Z", $match/fn:group[4] ) let $status := $match/fn:group[6] where $status eq "404" and $time ge xs:dateTime("2013-10-01T00:00:00") let $host := $match/fn:group[1] let $request := $match/fn:group[5] return text:put($host || "," || $request)
The query creates text files that contain the following lines:
192.0.2.1,GET /indexx.html HTTP/1.1 192.0.2.1,GET /aobut.html HTTP/1.1
See Also:
The W3C XPath and XQuery Functions and Operators 3.0 specification for information about the fn:tokenize
and fn:analyze-string
functions:
http://www.w3.org/TR/xpath-functions-30/#func-tokenize
http://www.w3.org/TR/xpath-functions-30/#func-analyze-string
For information about the Apache Common log format: