The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.
The functions are described in the following topics:
See Also:
The Hadoop wiki for a description of Hadoop sequence files atTo use the built-in functions in your query, you must import the sequence file module as follows:
import module "oxh:seq";
The sequence file module contains the following functions:
For examples, see "Examples of Sequence File Adapter Functions."
Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.
declare %seq:collection("text") function seq:collection($uris as xs:string*) as xs:string* external;
The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.
Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
declare %seq:collection("xml") function seq:collection-xml($uris as xs:string*) as document-node()* external;
The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable
values, the XML document encoding declaration is used, if it is available.
Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
declare %seq:collection("binxml") function seq:collection-binxml($uris as xs:string*) as document-node()* external;
The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The bytes are decoded as binary XML.
You can use this function to read files that were created by seq:put-binxml
in a previous query. See "seq:put-binxml."
Writes the string value of a key-value pair to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the values as org.apache.hadoop.io.Text
, and sets the key class to org.apache.hadoop.io.NullWritable
because there are no key values.
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."
Writes a key and string value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the keys and values as org.apache.hadoop.io.Text
.
declare %seq:put("text") function seq:put($key as xs:string, $value as xs:string) external;
Writes an XML value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the values as org.apache.hadoop.io.Text
, and sets the key class to org.apache.hadoop.io.NullWritable
because there are no key values.
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."
Writes a key and XML value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the keys and values as org.apache.hadoop.io.Text
.
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."
Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the values as org.apache.hadoop.io.BytesWritable
, and sets the key class to org.apache.hadoop.io.NullWritable
because there are no key values.
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."
You can use the seq:collection-binxml
function to read the files created by this function. See "seq:collection-binxml."
Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the keys as org.apache.hadoop.io.Text
and the values as org.apache.hadoop.io.BytesWritable
.
declare %seq:put("binxml") function seq:put-binxml($key as xs:string, $xml as node()) external;
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."
You can use the seq:collection-binxml
function to read the files created by this function. See "seq:collection-binxml."
This example queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.
mydata/comments1.xml <comments date="2013-12-30"> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> </comments> mydata/comments2.xml <comments date="2013-12-31"> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments> mydata/comments3.xml <comments date="2014-01-01"> <comment id="87654" user="mike" text="I don't feel so good."/> <comment id="23456" user="john" text="What a beautiful day!"> <like user="kelly"/> <like user="phil"/> </comment> </comments>
The following query stores the comment
elements in sequence files.
import module "oxh:seq"; import module "oxh:xmlf"; for $comment in xmlf:collection("mydata/comments*.xml", "comment") return seq:put-xml($comment)
The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput. The query then writes the names of users who made multiple comments to a text file.
import module "oxh:seq"; import module "oxh:text"; for $comment in seq:collection-xml("myoutput/part*")/comment let $user := $comment/@user group by $user let $count := count($comment) where $count gt 1 return text:put($user || " " || $count)
The text file created by the previous query contain the following lines:
john 2 mike 2
You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.
Custom functions for reading sequence files must have one of the following signatures:
declare %seq:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %seq:collection(["xml"|"binxml"]) [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Declares the sequence file collection
function, which reads sequence files. Required.
The optional method parameter can be one of the following values:
text
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
Foot 1 . They are returned as xs:string
. Default.
xml
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. The values are parsed as XML and returned by the function.
binxml
: The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The values are read as XDK binary XML and returned by the function. See the Oracle XML Developer’s Kit Programmer’s Guide.
Specifies the character encoding of the input files. The valid encodings are those supported by the JDK. UTF-8 is the default encoding.
See Also:
"Supported Encodings" in the Oracle Java SE documentation athttp://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true
to return the keys. The default setting is true
when method is binxml
or xml
, and false
when it is text
.
Text functions with this annotation set to true
must return text()*
instead of xs:string*
because atomic xs:string
is not associated with a document.
When the keys are returned, you can obtain their string representations by using seq:key
function.
This example returns text instead of string values because %seq:key
is set to true
.
declare %seq:collection("text") %seq:key("true") function local:col($uris as xs:string*) as text()* external;
The next example uses the seq:key
function to obtain the string representations of the keys:
for $value in local:col(...) let $key := $value/seq:key() return . . .
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-max(1024) %seq:split-max("1024") %seq:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-min(1024) %seq:split-min("1024") %seq:split-min("1K")
You can use the following annotations to define functions that write collections of sequence files in HDFS.
Custom functions for writing sequence files must have one of the following signatures. You can omit the $key
argument when you are not writing a key value.
declare %seq:put("text") [additional annotations] function local:myFunctionName($key as xs:string, $value as xs:string) external; declare %seq:put(["xml"|"binxml"]) [additional annotations] function local:myFunctionName($key as xs:string, $xml as node()) external;
Declares the sequence file put
function, which writes key-value pairs to a sequence file. Required.
If you use the $key
argument in the signature, then the key is written as org.apache.hadoop.io.Text
. If you omit the $key
argument, then the key class is set to org.apache.hadoop.io.NullWritable
.
Set the method parameter to text
, xml
, or binxml
. The method determines the type used to write the value:
text
: String written as org.apache.hadoop.io.Text
xml
: XML written as org.apache.hadoop.io.Text
binxml
: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before Codec
(case insensitive)
Set the compressionType parameter to one of these values:
block
: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.
record
: Only the values in the sequence file are compressed.
All of these examples use the default codec and block compression:
%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %seq:compress("DefaultCodec", "block") %seq:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put
. See "Serialization Annotations."
See Also:
The Hadoop Wiki SequenceFile topic at
http://wiki.apache.org/hadoop/SequenceFile
"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization 3.0 at
Examples of Sequence File Adapter Functions
The following query extracts comment
elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id
attribute and uses the value as the key in the sequence files.
import module "oxh:xmlf"; declare %seq:put("xml") %seq:compress("default", "block") %seq:file("comments") function local:myPut($key as xs:string, $value as node()) external;
for $comment in xmlf:collection("mydata/comments*.xml", "comment") let $id := $comment/@id let $newComment := copy $c := $comment modify delete node $c/@id return $c return local:myPut($id, $newComment)
The next query reads the sequence files that the previous query created in an output directory named myoutput. The query automatically decompresses the sequence files.
import module "oxh:text"; import module "oxh:seq"; for $comment in seq:collection-xml("myoutput/comments*")/comment let $id := $comment/seq:key() where $id eq "12345" return text:put-xml($comment)
The previous query creates a text file that contains the following line:
<comment id="12345" user="john" text="It is raining :( "/>
Footnote Legend
Footnote 1: Bytes are decoded using the character set specified by the%output:encoding
annotation.