The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.
The functions are described in the following topics:
See Also:
The Hadoop wiki for a description of Hadoop sequence files atTo use the built-in functions in your query, you must import the sequence file module as follows:
import module "oxh:seq";
The sequence file module contains the following functions:
For examples, see "Examples of Sequence File Adapter Functions."
Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.
declare %seq:collection("text") function seq:collection($uris as xs:string*) as xs:string* external;
$uris
: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.
One string for each value in each file
Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
declare %seq:collection("xml") function seq:collection-xml($uris as xs:string*) as document-node()* external;
$uris
: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable
values, the XML document encoding declaration is used, if it is available.
One XML document for each value in each file.
Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
declare %seq:collection("binxml") function seq:collection-binxml($uris as xs:string*) as document-node()* external;
$uris
: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The bytes are decoded as binary XML.
One XML document for each value in each file
You can use this function to read files that were created by seq:put-binxml
in a previous query. See "seq:put-binxml."
Writes either the string value or both the key and string value of a key-value pair to a sequence file in the output directory of the query.
This function writes the keys and values as org.apache.hadoop.io.Text
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.Text
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
declare %seq:put("text") function seq:put($key as xs:string, $value as xs:string) external; declare %seq:put("text") function seq:put($value as xs:string) external;
$key
: The key of a key-value pair
$value
: The value of a key-value pair
empty-sequence()
The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
Writes either an XML value or a key and XML value to a sequence file in the output directory of the query.
This function writes the keys and values as org.apache.hadoop.io.Text
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.Text
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
declare %seq:put("xml") function seq:put-xml($key as xs:string, $xml as node()) external; declare %seq:put("xml") function seq:put-xml($xml as node()) external;
$key
: The key of a key-value pair
$value
: The value of a key-value pair
empty-sequence()
The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."
Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the keys as org.apache.hadoop.io.Text
and the values as org.apache.hadoop.io.BytesWritable
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.BytesWritable
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
declare %seq:put("binxml") function seq:put-binxml($key as xs:string, $xml as node()) external; declare %seq:put("binxml") function seq:put-binxml($xml as node()) external;
$key
: The key of a key-value pair
$value
: The value of a key-value pair
empty-sequence()
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
You can use the seq:collection-binxml
function to read the files created by this function. See "seq:collection-binxml."
You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.
Custom functions for reading sequence files must have one of the following signatures:
declare %seq:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %seq:collection(["xml"|"binxml"]) [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Declares the sequence file collection function, which reads sequence files. Required.
The optional method parameter can be one of the following values:
text
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. Bytes are decoded using the character set specified by the %output:encoding
annotation. They are returned as xs:string
. Default.
xml
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. The values are parsed as XML and returned by the function.
binxml
: The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The values are read as XDK binary XML and returned by the function. See Oracle XML Developer's Kit Programmer's Guide.
Specifies the character encoding of the input files. The valid encodings are those supported by the JVM. UTF-8 is the default encoding.
See Also:
"Supported Encodings" in the Oracle Java SE documentation athttp://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
Controls whether the key of a key-value pair is set as the document-uri
of the returned value. Specify true
to return the keys. The default setting is true
when method is binxml
or xml
, and false
when it is text
.
Text functions with this annotation set to true
must return text()*
instead of xs:string*
because atomic xs:string
is not associated with a document.
When the keys are returned, you can obtain their string representations by using seq:key
function.
This example returns text instead of string values because %seq:key
is set to true
.
declare %seq:collection("text") %seq:key("true") function local:col($uris as xs:string*) as text()* external;
The next example uses the seq:key
function to obtain the string representations of the keys:
for $value in local:col(...) let $key := $value/seq:key() return . . .
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-max(1024) %seq:split-max("1024") %seq:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-min(1024) %seq:split-min("1024") %seq:split-min("1K")
You can use the following annotations to define functions that write collections of sequence files in HDFS.
Custom functions for writing sequence files must have one of the following signatures. You can omit the $key
argument when you are not writing a key value.
declare %seq:put("text") [additional annotations] function local:myFunctionName($key as xs:string, $value as xs:string) external; declare %seq:put(["xml"|"binxml"]) [additional annotations] function local:myFunctionName($key as xs:string, $xml as node()) external;
Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.
If you use the $key
argument in the signature, then the key is written as org.apache.hadoop.io.Text
. If you omit the $key
argument, then the key class is set to org.apache.hadoop.io.NullWritable
.
Set the method parameter to text
, xml
, or binxml
. The method determines the type used to write the value:
text
: String written as org.apache.hadoop.io.Text
xml
: XML written as org.apache.hadoop.io.Text
binxml
: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before Codec
(case insensitive)
Set the compressionType parameter to one of these values:
block
: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.
record
: Only the values in the sequence file are compressed.
All of these examples use the default codec and block compression:
%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %seq:compress("DefaultCodec", "block") %seq:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put
. See "Serialization Annotations."
See Also:
The Hadoop Wiki SequenceFile topic at
http://wiki.apache.org/hadoop/SequenceFile
"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization 3.0 at
These examples queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.
mydata/comments1.xml <comments date="2013-12-30"> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> </comments> mydata/comments2.xml <comments date="2013-12-31"> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments> mydata/comments3.xml <comments date="2014-01-01"> <comment id="87654" user="mike" text="I don't feel so good."/> <comment id="23456" user="john" text="What a beautiful day!"> <like user="kelly"/> <like user="phil"/> </comment> </comments>
The following query stores the comment
elements in sequence files.
import module "oxh:seq"; import module "oxh:xmlf"; for $comment in xmlf:collection("mydata/comments*.xml", "comment") return seq:put-xml($comment)
The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput
. The query then writes the names of users who made multiple comments to a text file.
import module "oxh:seq"; import module "oxh:text"; for $comment in seq:collection-xml("myoutput/part*")/comment let $user := $comment/@user group by $user let $count := count($comment) where $count gt 1 return text:put($user || " " || $count)
The text file created by the previous query contain the following lines:
john 2 mike 2
The following query extracts comment
elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id
attribute and uses the value as the key in the sequence files.
import module "oxh:xmlf"; declare %seq:put("xml") %seq:compress("default", "block") %seq:file("comments") function local:myPut($key as xs:string, $value as node()) external;
for $comment in xmlf:collection("mydata/comments*.xml", "comment") let $id := $comment/@id let $newComment := copy $c := $comment modify delete node $c/@id return $c return local:myPut($id, $newComment)
The next query reads the sequence files that the previous query created in an output directory named myoutput
. The query automatically decompresses the sequence files.
import module "oxh:text"; import module "oxh:seq"; for $comment in seq:collection-xml("myoutput/comments*")/comment let $id := $comment/seq:key() where $id eq "12345" return text:put-xml($comment)
The query creates a text file that contains the following line:
<comment id="12345" user="john" text="It is raining :( "/>