XML File Adapter

The XML file adapter provides access to XML files stored in HDFS. The adapter optionally splits individual XML files so that a single file can be processed in parallel by multiple tasks.

This adapter is described in the following topics:


Built-in Functions for Reading XML Files

To use the built-in functions in your query, you must import the XML file module as follows:

import module "oxh:xmlf";

The XML file module contains the following functions:

See "Examples of XML File Adapter Functions."

xmlf:collection

Accesses a collection of XML documents in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task.

Note:

HDFS does not perform well when data is stored in many small files. For large data sets with many small XML documents, use Hadoop sequence files and the Sequence File Adapter.

Signature

declare %xmlf:collection function
   xmlf:collection($uris as xs:string*) as document-node()* external;

Parameters

$uris

The XML file URIs

Returns

One XML document for each file

xmlf:collection

Accesses a collection of XML documents in HDFS. The files might be split and processed by multiple tasks simultaneously. The function returns only elements that match a specified name. This enables very large XML files to be processed efficiently.

This function only supports XML files that meet certain requirements. See "Restrictions on Splitting XML Files."

Signature

declare %xmlf:collection function
   xmlf:collection($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;

Parameters

$uris

The XML file URIs

$names

The names of the elements to be returned by the function. The names can be either strings or QNames. For QNames, the XML parser uses the namespace binding implied by the QName prefix and namespace.

Returns

Each element that matches one of the names specified by the $names argument

Examples of XML File Adapter Functions

This example queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>
 

This query writes the number of comments made each year to a text file. No element names are passed to xmlf:collection, and so it returns three documents, one for each file. Each file is processed serially by a single task.

import module "oxh:xmlf";
import module "oxh:text";

for $comments in xmlf:collection("mydata/comments*.xml")/comments
let $date := xs:date($comments/@date)
group by $year := fn:year-from-date($date)
return 
   text:put($year || ", " || fn:count($comments/comment))

The query creates text files that contain the following lines:

2013, 3
2014, 2

The next example writes the number of comments and the average number of likes for each user. Each input file is split, so that it can be processed in parallel by multiple tasks. The xmlf:collection function returns five elements, one for each comment.

import module "oxh:xmlf";
import module "oxh:text";

for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $likeCt := fn:count($comment/like)
group by $user := $comment/@user
return 
   text:put($user || ", " || fn:count($comment) || ", " || fn:avg($likeCt))
 

This query creates text files that contain the following lines:

john, 2, 1
kelly, 1, 2
mike, 2, 0.5

Custom Functions for Reading XML Files

You can use the following annotations to define functions that read collections of XML files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading XML files must have one of the following signatures:

declare %xmlf:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*) as node()* external;

declare %xmlf:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;

Annotations

%xmlf:collection

Declares the collection function. This annotation does not accept parameters. Required.

%xmlf:split("element-name1"[,... "element-nameN")

Specifies the element names used for parallel XML parsing. This annotation can be used instead of the $names argument.

When this annotation is specified, only the single argument version of the function is allowed. This enables the element names to be specified statically, so they do not need to be specified when the function is called.

%output:encoding("charset")

Identifies the text encoding of the input documents.

When this encoding is used with the %xmlf:split annotation or the $names argument, only ISO-8859-1, US-ASCII, and UTF-8 are valid encodings. Otherwise, the valid encodings are those supported by the JDK. UTF-8 is assumed when this annotation is omitted.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

%xmlf:split-namespace("prefix", "namespace")

This annotation provides extra namespace declarations to the parser. You can specify it multiple times to declare one or more namespaces.

Use this annotation to declare the namespaces of ancestor elements. When XML is processed in parallel, only elements that match the specified names are processed by an XML parser. If a matching element depends on the namespace declaration of one of its ancestor elements, then the declaration is not visible to the parser and an error may occur.

These namespace declarations can also be used in element names when specifying the split names. For example:

declare 
    %xmlf:collection 
    %xmlf:split("eg:foo") 
    %xmlf:split-namespace("eg", "http://example.org")
    function local:myFunction($uris as xs:string*) as document-node() external;
%xmlf:split-entity("entity-name", "entity-value")

Provides entity definitions to the XML parser. When XML is processed in parallel, only elements that match the specified split names are processed by an XML parser. The DTD of an input document that is split and processed in parallel is not processed.

In this example, the XML parser expands &foo; entity references as "Hello World":

%xmlf:split-entity("foo","Hello World")
%xmlf:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-max(1024)
%xmlf:split-max("1024")
%xmlf:split-max("1K")
%seq:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-min(1024)
%xmlf:split-min("1024")
%xmlf:split-min("1K")

Notes

Restrictions on Splitting XML Files

Individual XML documents can be processed in parallel when the element names are specified using either the $names argument or the $xmlf:split annotation.

The input documents must meet the following constraints in order to be processed in parallel:

  • XML cannot contain a comment, CDATA section, or processing instruction that contains text that matches one of the specified element names (that is, a < character followed by a name that expands to a QName). Otherwise, such content might be parsed incorrectly as an element.

  • An element in the file that matches a specified element name cannot contain a descendant element that also matches a specified name. Otherwise, multiple processors might pick up the matching descendant and cause the function to produce incorrect results.

  • An element that matches one of the specified element names (and all of its descendants) must not depend on the namespace declarations of any of its ancestors. Because the ancestors of a matching element are not parsed, the namespace declarations in these elements are not processed.

    You can work around this limitation by manually specifying the namespace declarations with the %xmlf:split-namespace annotation.

Oracle recommends that the specified element names do not match elements in the file that are bigger than the split size. If they do, then the adapter functions correctly but not efficiently.

Processing XML in parallel is difficult, because parsing cannot begin in the middle of an XML file. XML constructs like CDATA sections, comments, and namespace declarations impose this limitation. A parser starting in the middle of an XML document cannot assume that, for example, the string <foo> is a begin element tag, without searching backwards to the beginning of the document to ensure that it is not in a CDATA section or a comment. However, large XML documents typically contain sequences of similarly structured elements and thus are amenable to parallel processing. If you specify the element names, then each task works by scanning a portion of the document for elements that match one of the specified names. Only elements that match a specified name are given to a true XML parser. Thus, the parallel processor does not perform a true parse of the entire document.

Example

Example 1   Querying XML Files

The following example declares a custom function to access XML files:

import module "oxh:text";
 
declare 
   %xmlf:collection 
   %xmlf:split("comment")
   %xmlf:split-max("32M")
function local:comments($uris as xs:string*) as element()* external;
 
for $c in local:comments("mydata/comment*.xml")
where $c/@user eq "mike"
return text:put($c/@id)
 

The query creates a text file that contains the following lines:

54321
87654