Sequence File Adapter

The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.

The functions are described in the following topics:

Built-in Functions for Reading and Writing Sequence Files
Custom Functions for Reading Sequence Files
Custom Functions for Writing Sequence Files

Built-in Functions for Reading and Writing Sequence Files

To use the built-in functions in your query, you must import the sequence file module as follows:

import module "oxh:seq";

The sequence file module contains the following functions:

seq:collection
seq:collection-xml
seq:collection-binxml
seq:put
seq:put-xml
seq:put-binxml

For examples, see "Examples of Sequence File Adapter Functions."

seq:collection

Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("text") function 
   seq:collection($uris as xs:string*) as xs:string* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.

Return Value

One string for each value in each file

seq:collection-xml

Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("xml") function 
   seq:collection-xml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the XML document encoding declaration is used, if it is available.

Returns

One XML document for each value in each file.

seq:collection-binxml

Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("binxml") function 
   seq:collection-binxml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The bytes are decoded as binary XML.

Returns

One XML document for each value in each file

Notes

You can use this function to read files that were created by seq:put-binxml in a previous query. See "seq:put-binxml."

seq:put

Writes the string value of a key-value pair to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the values as org.apache.hadoop.io.Text, and sets the key class to org.apache.hadoop.io.NullWritable because there are no key values.

Signature

declare %seq:put("text") function 
   seq:put($value as xs:string) external;

Parameters

$value: The string to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

seq:put

Writes a key and string value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys and values as org.apache.hadoop.io.Text.

Signature

declare %seq:put("text") function
   seq:put($key as xs:string, $value as xs:string) external;

Parameters

$key: The key of a key-value pair
$value: The value of the key-value pair

Returns

empty-sequence()

seq:put-xml

Writes an XML value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the values as org.apache.hadoop.io.Text, and sets the key class to org.apache.hadoop.io.NullWritable because there are no key values.

Signature

declare %seq:put("xml") function 
   seq:put-xml($xml as node()) external;

Parameters

$value: The XML to write

Returns

empty-sequence()

Notes

seq:put-xml

Writes a key and XML value to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys and values as org.apache.hadoop.io.Text.

Signature

declare %seq:put("xml") function
   seq:put-xml($key as xs:string, $xml as node()) external;

Parameters

$key: The key of a key-value pair
$value: The value of a key-value pair

Returns

empty-sequence()

Notes

seq:put-binxml

Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the values as org.apache.hadoop.io.BytesWritable, and sets the key class to org.apache.hadoop.io.NullWritable because there are no key values.

Signature

declare %seq:put("binxml") function 
   seq:put-binxml($xml as node()) external;

Parameters

$value: The XML to write

Return Value

empty-sequence()

Notes

You can use the seq:collection-binxml function to read the files created by this function. See "seq:collection-binxml."

seq:put-binxml

Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys as org.apache.hadoop.io.Text and the values as org.apache.hadoop.io.BytesWritable.

Signature

declare %seq:put("binxml") function
   seq:put-binxml($key as xs:string, $xml as node()) external;

Parameters

$key: The key of a key-value pair
$value: The value to write

Returns

empty-sequence()

Notes

You can use the seq:collection-binxml function to read the files created by this function. See "seq:collection-binxml."

Examples of Sequence File Adapter Functions

This example queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>

Example 1

The following query stores the comment elements in sequence files.

import module "oxh:seq";
import module "oxh:xmlf";
 
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
return 
   seq:put-xml($comment)

Example 2

The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput. The query then writes the names of users who made multiple comments to a text file.

import module "oxh:seq";
import module "oxh:text";

for $comment in seq:collection-xml("myoutput/part*")/comment
let $user := $comment/@user
group by $user
let $count := count($comment)
where $count gt 1
return
   text:put($user || " " || $count)

The text file created by the previous query contain the following lines:

john 2
mike 2

See "XML File Adapter."

Custom Functions for Reading Sequence Files

You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading sequence files must have one of the following signatures:

declare %seq:collection("text") [additional annotations] 
   function local:myFunctionName($uris as xs:string*) as xs:string* external;

declare %seq:collection(["xml"|"binxml"]) [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%seq:collection(["method"])

Declares the sequence file collection function, which reads sequence files. Required.

The optional method parameter can be one of the following values:

text: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable^Foot 1. They are returned as xs:string. Default.
xml: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. The values are parsed as XML and returned by the function.
binxml: The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The values are read as XDK binary XML and returned by the function. See the Oracle XML Developer’s Kit Programmer’s Guide.

%output:encoding("charset")

Specifies the character encoding of the input files. The valid encodings are those supported by the JDK. UTF-8 is the default encoding.

Custom Functions for Writing Sequence Files

You can use the following annotations to define functions that write collections of sequence files in HDFS.

Signature

Custom functions for writing sequence files must have one of the following signatures. You can omit the $key argument when you are not writing a key value.

declare %seq:put("text") [additional annotations] 
   function local:myFunctionName($key as xs:string, $value as xs:string) external;

declare %seq:put(["xml"|"binxml"]) [additional annotations] 
   function local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations

%seq:put("method")

Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.

If you use the $key argument in the signature, then the key is written as org.apache.hadoop.io.Text. If you omit the $key argument, then the key class is set to org.apache.hadoop.io.NullWritable.

Set the method parameter to text, xml, or binxml. The method determines the type used to write the value:

text: String written as org.apache.hadoop.io.Text
xml: XML written as org.apache.hadoop.io.Text
binxml: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable

%seq:compress("codec", "compressionType")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before Codec (case insensitive)

Set the compressionType parameter to one of these values:

block: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.
record: Only the values in the sequence file are compressed.

All of these examples use the default codec and block compression:

%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%seq:compress("DefaultCodec", "block")
%seq:compress("default", "block")

%seq:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put. See "Serialization Annotations."