Sequence File Adapter

The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.

The functions are described in the following topics:

Built-in Functions for Reading and Writing Sequence Files
Custom Functions for Reading Sequence Files
Custom Functions for Writing Sequence Files
Examples of Sequence File Adapter Functions

Built-in Functions for Reading and Writing Sequence Files

To use the built-in functions in your query, you must import the sequence file module as follows:

import module "oxh:seq";

The sequence file module contains the following functions:

seq:collection
seq:collection-xml
seq:collection-binxml
seq:put
seq:put-xml
seq:put-binxml

For examples, see "Examples of Sequence File Adapter Functions."

seq:collection

Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("text") function 
   seq:collection($uris as xs:string*) as xs:string* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.

Returns

One string for each value in each file

seq:collection-xml

Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("xml") function 
   seq:collection-xml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the XML document encoding declaration is used, if it is available.

Returns

One XML document for each value in each file.

seq:collection-binxml

Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("binxml") function 
   seq:collection-binxml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The bytes are decoded as binary XML.

Returns

One XML document for each value in each file

Notes

You can use this function to read files that were created by seq:put-binxml in a previous query. See "seq:put-binxml."

seq:put

Writes either the string value or both the key and string value of a key-value pair to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.Text and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("text") function
   seq:put($key as xs:string, $value as xs:string) external;

declare %seq:put("text") function 
   seq:put($value as xs:string) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

seq:put-xml

Writes either an XML value or a key and XML value to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

Signature

declare %seq:put("xml") function
   seq:put-xml($key as xs:string, $xml as node()) external;

declare %seq:put("xml") function 
   seq:put-xml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

seq:put-binxml

Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys as org.apache.hadoop.io.Text and the values as org.apache.hadoop.io.BytesWritable.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.BytesWritable and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("binxml") function
   seq:put-binxml($key as xs:string, $xml as node()) external;

declare %seq:put("binxml") function 
   seq:put-binxml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

You can use the seq:collection-binxml function to read the files created by this function. See "seq:collection-binxml."

Custom Functions for Reading Sequence Files

You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading sequence files must have one of the following signatures:

declare %seq:collection("text") [additional annotations] 
   function local:myFunctionName($uris as xs:string*) as xs:string* external;

declare %seq:collection(["xml"|"binxml"]) [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%seq:collection(["method"])

Declares the sequence file collection function, which reads sequence files. Required.

The optional method parameter can be one of the following values:

text: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. Bytes are decoded using the character set specified by the %output:encoding annotation. They are returned as xs:string. Default.
xml: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. The values are parsed as XML and returned by the function.
binxml: The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The values are read as XDK binary XML and returned by the function. See Oracle XML Developer's Kit Programmer's Guide.

%output:encoding("charset")

Specifies the character encoding of the input files. The valid encodings are those supported by the JVM. UTF-8 is the default encoding.

Custom Functions for Writing Sequence Files

You can use the following annotations to define functions that write collections of sequence files in HDFS.

Signature

Custom functions for writing sequence files must have one of the following signatures. You can omit the $key argument when you are not writing a key value.

declare %seq:put("text") [additional annotations] 
   function local:myFunctionName($key as xs:string, $value as xs:string) external;

declare %seq:put(["xml"|"binxml"]) [additional annotations] 
   function local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations

%seq:put("method")

Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.

If you use the $key argument in the signature, then the key is written as org.apache.hadoop.io.Text. If you omit the $key argument, then the key class is set to org.apache.hadoop.io.NullWritable.

Set the method parameter to text, xml, or binxml. The method determines the type used to write the value:

text: String written as org.apache.hadoop.io.Text
xml: XML written as org.apache.hadoop.io.Text
binxml: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable

%seq:compress("codec", "compressionType")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before Codec (case insensitive)

Set the compressionType parameter to one of these values:

block: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.
record: Only the values in the sequence file are compressed.

All of these examples use the default codec and block compression:

%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%seq:compress("DefaultCodec", "block")
%seq:compress("default", "block")

%seq:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put. See "Serialization Annotations."

Examples of Sequence File Adapter Functions

These examples queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>

Example 1

The following query stores the comment elements in sequence files.

import module "oxh:seq";
import module "oxh:xmlf";
 
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
return 
   seq:put-xml($comment)

Example 2

The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput. The query then writes the names of users who made multiple comments to a text file.

import module "oxh:seq";
import module "oxh:text";

for $comment in seq:collection-xml("myoutput/part*")/comment
let $user := $comment/@user
group by $user
let $count := count($comment)
where $count gt 1
return
   text:put($user || " " || $count)

The text file created by the previous query contain the following lines:

john 2
mike 2

See "XML File Adapter."

Example 3

The following query extracts comment elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id attribute and uses the value as the key in the sequence files.

import module "oxh:xmlf";

declare 
   %seq:put("xml")
   %seq:compress("default", "block") 
   %seq:file("comments")
function local:myPut($key as xs:string, $value as node()) external;

for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $id := $comment/@id
let $newComment := 
   copy $c := $comment 
   modify delete node $c/@id
   return $c
return
   local:myPut($id, $newComment)

Example 4

The next query reads the sequence files that the previous query created in an output directory named myoutput. The query automatically decompresses the sequence files.

import module "oxh:text";
import module "oxh:seq";

for $comment in seq:collection-xml("myoutput/comments*")/comment
let $id := $comment/seq:key()
where $id eq "12345"
return 
   text:put-xml($comment)

The query creates a text file that contains the following line:

<comment id="12345" user="john" text="It is raining :( "/>