Sequence File Adapter

The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.

The functions are described in the following topics:

See Also:

The Hadoop wiki for a description of Hadoop sequence files at

http://wiki.apache.org/hadoop/SequenceFile

Built-in Functions for Reading and Writing Sequence Files

To use the built-in functions in your query, you must import the sequence file module as follows:

import module "oxh:seq";

The sequence file module contains the following functions:

For examples, see "Examples of Sequence File Adapter Functions."

seq:collection

Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("text") function 
   seq:collection($uris as xs:string*) as xs:string* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.

Returns

One string for each value in each file

seq:collection-xml

Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("xml") function 
   seq:collection-xml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the XML document encoding declaration is used, if it is available.

Returns

One XML document for each value in each file.

seq:collection-binxml

Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("binxml") function 
   seq:collection-binxml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The bytes are decoded as binary XML.

Returns

One XML document for each value in each file

Notes

You can use this function to read files that were created by seq:put-binxml in a previous query. See "seq:put-binxml."

See Also

Oracle XML Developer's Kit Programmer's Guide

seq:put

Writes either the string value or both the key and string value of a key-value pair to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.Text and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("text") function
   seq:put($key as xs:string, $value as xs:string) external;

declare %seq:put("text") function 
   seq:put($value as xs:string) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

seq:put-xml

Writes either an XML value or a key and XML value to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.Text and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("xml") function
   seq:put-xml($key as xs:string, $xml as node()) external;

declare %seq:put("xml") function 
   seq:put-xml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

seq:put-binxml

Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys as org.apache.hadoop.io.Text and the values as org.apache.hadoop.io.BytesWritable.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.BytesWritable and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("binxml") function
   seq:put-binxml($key as xs:string, $xml as node()) external;

declare %seq:put("binxml") function 
   seq:put-binxml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running a Query."

You can use the seq:collection-binxml function to read the files created by this function. See "seq:collection-binxml."

See Also

Oracle XML Developer's Kit Programmer's Guide

Custom Functions for Reading Sequence Files

You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading sequence files must have one of the following signatures:

declare %seq:collection("text") [additional annotations] 
   function local:myFunctionName($uris as xs:string*) as xs:string* external;

declare %seq:collection(["xml"|"binxml"]) [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%seq:collection(["method"])

Declares the sequence file collection function, which reads sequence files. Required.

The optional method parameter can be one of the following values:

  • text: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. Bytes are decoded using the character set specified by the %output:encoding annotation. They are returned as xs:string. Default.

  • xml: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. The values are parsed as XML and returned by the function.

  • binxml: The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The values are read as XDK binary XML and returned by the function. See Oracle XML Developer's Kit Programmer's Guide.

%output:encoding("charset")

Specifies the character encoding of the input files. The valid encodings are those supported by the JVM. UTF-8 is the default encoding.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

%seq:key("true" | "false")

Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the keys. The default setting is true when method is binxml or xml, and false when it is text.

Text functions with this annotation set to true must return text()* instead of xs:string* because atomic xs:string is not associated with a document.

When the keys are returned, you can obtain their string representations by using seq:key function.

This example returns text instead of string values because %seq:key is set to true.

declare %seq:collection("text") %seq:key("true")
   function local:col($uris as xs:string*) as text()* external;

The next example uses the seq:key function to obtain the string representations of the keys:

for $value in local:col(...)
let $key := $value/seq:key()
return 
   .
   .
   .
%seq:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%seq:split-max(1024)
%seq:split-max("1024")
%seq:split-max("1K")
%seq:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%seq:split-min(1024)
%seq:split-min("1024")
%seq:split-min("1K")

Custom Functions for Writing Sequence Files

You can use the following annotations to define functions that write collections of sequence files in HDFS.

Signature

Custom functions for writing sequence files must have one of the following signatures. You can omit the $key argument when you are not writing a key value.

declare %seq:put("text") [additional annotations] 
   function local:myFunctionName($key as xs:string, $value as xs:string) external;

declare %seq:put(["xml"|"binxml"]) [additional annotations] 
   function local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations

%seq:put("method")

Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.

If you use the $key argument in the signature, then the key is written as org.apache.hadoop.io.Text. If you omit the $key argument, then the key class is set to org.apache.hadoop.io.NullWritable.

Set the method parameter to text, xml, or binxml. The method determines the type used to write the value:

  • text: String written as org.apache.hadoop.io.Text

  • xml: XML written as org.apache.hadoop.io.Text

  • binxml: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable

%seq:compress("codec", "compressionType")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

  1. The fully qualified class name of the codec

  2. The unqualified class name of the codec

  3. The prefix of the unqualified class name before Codec (case insensitive)

Set the compressionType parameter to one of these values:

  • block: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.

  • record: Only the values in the sequence file are compressed.

All of these examples use the default codec and block compression:

%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%seq:compress("DefaultCodec", "block")
%seq:compress("default", "block") 
%seq:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put. See "Serialization Annotations."

See Also:

The Hadoop Wiki SequenceFile topic at

http://wiki.apache.org/hadoop/SequenceFile

"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization 3.0 at

http://www.w3.org/TR/xslt-xquery-serialization-30/

Examples of Sequence File Adapter Functions

These examples queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>
Example 1   

The following query stores the comment elements in sequence files.

import module "oxh:seq";
import module "oxh:xmlf";
 
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
return 
   seq:put-xml($comment)
Example 2   

The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput. The query then writes the names of users who made multiple comments to a text file.

import module "oxh:seq";
import module "oxh:text";

for $comment in seq:collection-xml("myoutput/part*")/comment
let $user := $comment/@user
group by $user
let $count := count($comment)
where $count gt 1
return
   text:put($user || " " || $count)

The text file created by the previous query contain the following lines:

john 2
mike 2

See "XML File Adapter."

Example 3   

The following query extracts comment elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id attribute and uses the value as the key in the sequence files.

import module "oxh:xmlf";

declare 
   %seq:put("xml")
   %seq:compress("default", "block") 
   %seq:file("comments")
function local:myPut($key as xs:string, $value as node()) external;    
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $id := $comment/@id
let $newComment := 
   copy $c := $comment 
   modify delete node $c/@id
   return $c
return
   local:myPut($id, $newComment)
Example 4   

The next query reads the sequence files that the previous query created in an output directory named myoutput. The query automatically decompresses the sequence files.

import module "oxh:text";
import module "oxh:seq";

for $comment in seq:collection-xml("myoutput/comments*")/comment
let $id := $comment/seq:key()
where $id eq "12345"
return 
   text:put-xml($comment)
 

The query creates a text file that contains the following line:

<comment id="12345" user="john" text="It is raining :( "/>