The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:
To use the built-in functions in your query, you must import the text file module as follows:
import module "oxh:text";
The text file module contains the following functions:
For examples, see "Examples of Text File Adapter Functions."
Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.
The files might be split up and processed in parallel by multiple tasks.
declare %text:collection("text") function text:collection($uris as xs:string*) as xs:string* external; declare %text:collection("text") function function text:collection($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;
$uris
: The text file URIs.
$delimiter
: A custom delimiter on which the file is split. The default is the newline character.
One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file
Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.
The files might be split up and processed in parallel by multiple tasks. Each delimited section of each file is parsed as an XML document and returned by the function. Therefore, each segment must fully contain a single XML document, and any delimit characters in the XML must be escaped with XML character references. By default, the delimiter is a new line.
declare %text:collection("xml") function text:collection-xml($uris as xs:string*) as document-node()* external; declare %text:collection("xml") function text:collection-xml($uris as xs:string*, $delimiter as xs:string?) as document-node()* external;
$uris
: The text file URIs.
$delimiter
: A custom delimiter on which the file is split. The default is the newline character.
One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file
Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.
declare %text:put("text") function text:put($value as xs:string) external;
$value
: The text to write
empty-sequence()
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running a Query."
Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.
Newline characters in the serialized XML are replaced with character references to ensure that the XML does not span multiple lines. For example, 

replaces the linefeed character (\n
).
declare %text:put("xml") function text:put-xml($value as node()) external;
$value
: The XML to write
empty-sequence()
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running a Query."
Writes a line to a text file named trace-*
in the output directory of the query. The lines are spread across one or more files.
This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.
declare %text:put("text") %text:file("trace") function text:trace($value as xs:string) external;
$value
: The text to write
empty-sequence()
You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
The input files can be compressed with a Hadoop-supported compression codec. They are automatically decompressed when read.
Custom functions for reading text files must have one of the following signatures:
declare %text:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as xs:string* external; declare %text:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %text:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as document-node()* external declare %text:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Declares the text
collection function. Required.
The optional method parameter can be one of the following values:
text
: Each line in the text file is returned as xs:string
. Default.
xml
: Each line in the text file is parsed as XML and returned as document-node
. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.
Specifies a custom delimiter for splitting the input files. The default delimiter is the newline character.
Do not combine this annotation with the $delimiter
parameter. To specify a custom delimiter, use either this annotation or the $delimiter
parameter.
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%text:split-max(1024) %text:split-max("1024") %text:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%text:split-min(1024) %text:split-min("1024") %text:split-min("1K")
Lists the HDFS file URIs. The files can be uncompressed or compressed with a Hadoop-supported codec. Required.
A custom delimiter on which the input text files are split. The default delimiter is a new line. Do not combine this parameter with the %text:split
annotation.
You can use the following annotations to define functions that write text files in HDFS.
Custom functions for writing text files must have one of the following signatures:
declare %text:put("text") [additional annotations] function text:myFunctionName($value as xs:string) external; declare %text:put("xml") [additional annotations] function text:myFunctionName($value as node()) external;
Declares the text
put function. Required.
The optional method parameter can be one of the following values:
text
: Writes data to a text file. Default.
xml
: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This process ensures that the resulting XML document is one text line with no line breaks.
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before "Codec" (case insensitive)
All of these examples use the default codec and block compression:
%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %text:compress("DefaultCodec", "block") %text:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put
. See "Serialization Annotations."
UTF-8 is currently the only supported character encoding.
This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.
mydata/visits1.log 2013-10-28T06:00:00, john, index.html, 200 2013-10-28T08:30:02, kelly, index.html, 200 2013-10-28T08:32:50, kelly, about.html, 200 2013-10-30T10:00:10, mike, index.html, 401 mydata/visits2.log 2013-10-30T10:00:01, john, index.html, 200 2013-10-30T10:05:20, john, about.html, 200 2013-11-01T08:00:08, laura, index.html, 200 2013-11-04T06:12:51, kelly, index.html, 200 2013-11-04T06:12:40, kelly, contact.html, 200
The following query filters out the pages visited by john
and writes only the date and page visited to a new text file:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") where $split[2] eq "john" return text:put($split[1] || " " || $split[3])
This query creates a text file that contains the following lines:
2013-10-28T06:00:00 index.html 2013-10-30T10:00:01 index.html 2013-10-30T10:05:20 about.html
The next query computes the number of page visits per day:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $time := xs:dateTime($split[1]) let $day := xs:date($time) group by $day return text:put($day || " => " || count($line))
The query creates text files that contain the following lines:
2013-10-28 => 3 2013-10-30 => 3 2013-11-01 => 1 2013-11-04 => 2
This example uses the fn:tokenize
function to parse the lines of a text file. This technique works well for simple delimited formats.
The following query declares custom put and collection functions. It computes the number of hits and the number of unique users for each page in the logs.
import module "oxh:text"; declare %text:collection("text") %text:split-max("32m") function local:col($uris as xs:string*) as xs:string* external; declare %text:put("xml") %text:compress("gzip") %text:file("pages") function local:out($arg as node()) external; for $line in local:col("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $user := $split[2] let $page := $split[3] group by $page return local:out( <page> <name>{$page}</name> <hits>{count($line)}</hits> <users>{fn:count(fn:distinct-values($user))}</users> </page> )
The output directory of the previous query is named myoutput
. The following lines are written to myoutput/pages-r-*.gz
.
<page><name>about.html</name><hits>2</hits><users>2</users></page> <page><name>contact.html</name><hits>1</hits><users>1</users></page> <page><name>index.html</name><hits>6</hits><users>4</users></page>
The files are compressed with the gzip
codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.
import module "oxh:text"; for $page in text:collection-xml("myoutput/page*.gz")/page return text:put($page/name || "," || $page/hits)
This query creates text files that contains the following lines:
about.html,2 contact.html,1 index.html,6
The fn:tokenize
function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string
function to process a log file in the Apache Common Log format.
A text file named mydata/access.log in HDFS contains the following lines:
192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284 192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567 192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284 192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string
to match the components of the log entries. The time format cannot be cast directly to xs:dateTime
, as shown in Example 2. Instead, the ora-fn:dateTime-from-string-with-format
function converts the string to an instance of xs:dateTime
.
import module "oxh:text"; declare variable $REGEX := '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)'; for $line in text:collection("mydata/access.log") let $match := fn:analyze-string($line, $REGEX)/fn:match let $time := ora-fn:dateTime-from-string-with-format( "dd/MMM/yyyy:HH:mm:ss Z", $match/fn:group[4] ) let $status := $match/fn:group[6] where $status eq "404" and $time ge xs:dateTime("2013-10-01T00:00:00") let $host := $match/fn:group[1] let $request := $match/fn:group[5] return text:put($host || "," || $request)
The query creates text files that contain the following lines:
192.0.2.1,GET /indexx.html HTTP/1.1 192.0.2.1,GET /aobut.html HTTP/1.1
See Also:
XPath and XQuery Functions and Operators 3.0 specification for information about the fn:tokenize
and fn:analyze-string
functions:
http://www.w3.org/TR/xpath-functions-30/#func-tokenize
http://www.w3.org/TR/xpath-functions-30/#func-analyze-string
For information about the Apache Common log format: