6 Oracle XQuery for Hadoop Reference

This chapter describes the adapters available in Oracle XQuery for Hadoop:

This chapter also describes several other library modules:

6.1 Avro File Adapter

The Avro file adapter provides functions to read and write Avro container files in HDFS. It is described in the following topics:

Note:

Additional Configuration Steps for HDP Users

Oracle XQuery for Hadoop has been verified to run on both Cloudera’s CDH5 and Hortonwork’s HDP 2.3.3.  However, to run queries that write to Avro container files in HDP 2.3.2, you must change the OXH classpath definition to use avro-mapred-1.7.4-hadoop2.jar.
  1. Download the JAR from the Apache archive https://archive.apache.org/dist/avro/avro-1.7.4/java/

  2. In $OXH_HOME/lib/oxh-lib.xml locate the following path tag:

    <path id="oracle.hadoop.xquery.avro.lib">
        <fileset dir="${oracle.hadoop.loader.olh_home}/jlib" erroronmissingdir="false">
            <include name="avro-mapred*.jar"/>
        </fileset>
    </path>
    

    Replace the path tag above with the following revision.[DIRECTORY] in this example is a placeholder. Replace it with the directory path to the JAR.

    <path id="oracle.hadoop.xquery.avro.lib”>
        <fileset dir="[DIRECTORY]">
            <include name="avro-mapred-1.7.4-hadoop2.jar"/>\
        </fileset>
    </path>
    

6.1.1 Built-in Functions for Reading Avro Files

To use the built-in functions in your query, you must import the Avro file module as follows:

import module "oxh:avro";

The Avro file module contains the following functions:

There are no built-in functions for writing Avro container files. To write Avro files, you must use a custom function that specifies the Avro writer schema.

6.1.1.1 avro:collection-avroxml

Accesses a collection of Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."

Signature

declare %avro:collection("avroxml") function 
   avro:collection-avroxml($uris as xs:string*) as element()* external;

Parameters

$uris: The Avro file URIs

Returns

One XML element for each Avro object.

6.1.1.2 avro:get

Retrieves an entry from an Avro map modeled as XML

If you omit the $map parameter, then the behavior is identical to calling the two-argument function and using the context item for $map.

Signature

avro:get($key as xs:string?, $map as node()?) as element(oxh:entry)?
 
avro:get($key as xs:string?) as element(oxh:entry)?

Returns

The value of this XPath expression:

$map/oxh:entry[@key eq $key]

Example

These function calls are equivalent:

$var/avro:get("key")

avro:get("key", $var)
 
$var/oxh:entry[@key eq "key"]

In this example, $var is an Avro map modeled as XML. See "Reading Maps."

6.1.2 Custom Functions for Reading Avro Container Files

You can use the following annotations to define functions that read collections of Avro container files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading Avro files must have the following signature:

declare %avro:collection("avroxml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as element()* external;

Annotations

%avro:collection("avroxml")

Declares the avroxml collection function. Required.

A collection function accesses Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."

%avro:schema("avro-schema")

Provides the Avro reader schema as the value of the annotation. Optional.

The objects in the file are mapped to the reader schema when it is specified. For example:

%avro:schema('
   {
      "type": "record",
      "name": "Person",
      "fields" : [
         {"name": "full_name", "type": "string"},
         {"name": "age", "type": ["int", "null"] }
       ]
   }
')

You cannot combine this annotation with %avro:schema-file or %avro:schema-kv.

See Also:

"Schema Resolution" in the Apache Avro Specification at

http://avro.apache.org/docs/current/spec.html#Schema+Resolution

%avro:schema-file("avro-schema-uri")

Like %avro:schema, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system. Optional.

For example, %avro:schema-file("schemas/person.avsc").

You cannot combine this annotation with %avro:schema or %avro:schema-kv.

%avro:schema-kv("schema-name")

Like %avro:schema, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. Optional.

For example, %avro:schema-kv("org.example.PersonRecord").

You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."

You cannot combine this annotation with %avro:schema or %avro:schema-file.

%avro:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%avro:split-max(1024)
%avro:split-max("1024")
%avro:split-max("1K")
%avro:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%avro:split-min(1024)
%avro:split-min("1024")
%avro:split-min("1K")

6.1.3 Custom Functions for Writing Avro Files

You can use the following annotations to define functions that write Avro files.

Signature

Custom functions for writing Avro files must have the following signature:

declare %avro:put("avroxml") [additional annotations]
   local:myFunctionName($value as item()) external;

Annotations

%avro:put("avroxml")

Declares the avroxml put function. Required.

An Avro schema must be specified using one of the following annotations:

  • %avro:schema

  • %avro:schema-file

  • %avro:schema-kv

The input XML value is converted to an instance of the schema. See "Writing XML as Avro."

%avro:schema("avro-schema")

Specifies the schema of the files. For example:

%avro:schema('
   {
      "type": "record",
      "name": "Person",
      "fields" : [
         {"name": "full_name", "type": "string"},
         {"name": "age", "type": ["int", "null"] }
      ]
   }
')

You cannot combine this annotation with %avro:schema-file or %avro:schema-kv.

%avro:schema-file("avro-schema-uri")

Like %avro:schema, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system.

For example: %avro:schema-file("schemas/person.avsc")

You cannot combine this annotation with %avro:schema or %avro:schema-kv.

%avro:schema-kv("schema-name")

Like %avro:schema, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog.

For example: %avro:schema-kv("org.example.PersonRecord")

You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."

You cannot combine this annotation with %avro:schema or %avro:schema-file.

%avro:compress("method", [level]?)

Specifies the compression format used on the output.

The codec is one of the following string literal values:

  • deflate: The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.

  • snappy: This algorithm is designed for high speed and moderate compression.

The default is no compression.

The level is an integer value. It is optional and only supported when codec is deflate.

For example:

%avro:compress("snappy")
%avro:compress("deflate")
%avro:compress("deflate", 3)
%avro:file("name")

Specifies the output file name prefix. The default prefix is part.

6.1.4 Examples of Avro File Adapter Functions

These examples use the following text file in HDFS:

mydata/ages.txt

john,45
kelly,36
laura,
mike,27

Example 6-1 Converting a Text File to Avro

The following query converts the file into compressed Avro container files:

import module "oxh:text";
 
declare 
   %avro:put("avroxml")
   %avro:compress("snappy")
   %avro:schema('
      {
         "type": "record",
         "name": "AgeRec",
         "fields" : [
            {"name": "user", "type": "string"},
            {"name": "age", "type": ["int", "null"] }
         ]
      }
   ')
function local:put($arg as item()) external;

for $line in text:collection("mydata/ages.txt")
let $split := fn:tokenize($line, ",")
return
   local:put(
      <rec>
         <user>{$split[1]}</user>
         {
            if ($split[2] castable as xs:int) then
               <age>{$split[2]}</age>
            else 
               ()
         }
      </rec>
   )

The query generates an Avro file with the following records, represented here as JSON:

{"user":"john","age":{"int":45}}
{"user":"kelly","age":{"int":36}}
{"user":"laura","age":null}
{"user":"mike","age":{"int":27}}

Example 6-2 Querying Records in Avro Container Files

The next query selects records in which the age is either null or greater than 30, from the myoutput directory. The query in Example 6-1generated the records.

import module "oxh:text";
import module "oxh:avro";

for $rec in avro:collection-avroxml("myoutput/part*.avro")
where $rec/age/nilled() or $rec/age gt 30
return
   text:put($rec/user)

This query creates files that contain the following lines:

john
kelly
laura

6.1.5 About Converting Values Between Avro and XML

This section describes how Oracle XQuery for Hadoop converts data between Avro and XML:

6.1.5.1 Reading Avro as XML

Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml method, which you can use with the collection functions to read Avro records as XML. After the Avro is converted to XML, you can query and transform the data using XQuery.

The following topics describe how Oracle XQuery for Hadoop reads Avro:

6.1.5.1.1 Reading Records

An Avro record is converted to an <oxh:item> element with one child element for each field in the record.

For example, consider the following Avro schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

This is an instance of the record modeled as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <age>46</age>
</oxh:item>

Converting Avro records to XML enables XQuery to query them. The next example queries an Avro container file named person.avro, which contains Person records. The query converts the records to a CSV text file in which each line contains the full_name and age values:

import module "oxh:avro";
import module "oxh:text";
 
for $x in avro:collection-avroxml("person.avro")
return
   text:put($x/full_name || "," || $x/age)
 

Null values are converted to nilled elements. A nilled element has an xsi:nil attribute set to true; it is always empty. You can use the XQuery fn:nilled function to test if a record field is null. For example, the following query writes the name of Person records that have a null value for age:

import module "oxh:avro";
import module "oxh:text";
 
for $x in avro:collection-avroxml("person.avro")
where $x/age/nilled()
return
   text:put($x/full_name)
 

For nested records, the fields of the inner schema become child elements of the element that corresponds to the field in the outer schema. For example, this schema has a nested record:

{
   "type": "record", 
   "name": "PersonAddress",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "address", "type": 
         { "type" : "record",
           "name" : "Address",
           "fields" : [
              { "name" : "street", "type" : "string" },
              { "name" : "city", "type" : "string" }
            ]
         }
      }
   ]
}

This is an instance of the record as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <address>
      <street>123 First St.</street>
      <city>New York</city>
   </address>
</oxh:item>

The following example queries an Avro container file named people-address.avro that contains PersonAddress records, and writes the names of the people that live in New York to a text file:

import module "oxh:avro";
import module "oxh:text";

for $person in avro:collection-avroxml("examples/person-address.avro")
where $person/address/city eq "New York"
return
   text:put($person/full_name)
6.1.5.1.2 Reading Maps

Avro map values are converted to an element that contains one child <oxh:entry> element for each entry in the map. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonProperties",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "properties", "type": 
         {"type": "map", "values": "string"}
      }
   ]
}

This is an instance of the schema as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <properties>
      <oxh:entry key="employer">Example Inc</oxh:entry>
      <oxh:entry key="hair color">brown</oxh:entry>
      <oxh:entry key="favorite author">George RR Martin</oxh:entry>
   </properties>
</oxh:item>

The following example queries a file named person-properties.avro that contains PersonAddress records, and writes the names of the people that are employed by Example Inc. The query shows how regular XPath expressions can retrieve map entries. Moreover, you can use the avro:get function as a shortcut to retrieve map entries.

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-properties.avro")
where $person/properties/oxh:entry[@key eq "employer"] eq "Example Inc"
return
   text:put($person/full_name)
 

The following query uses the avro:get function to retrieve the employer entry. It is equivalent to the previous query.

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-properties.avro")
where $person/properties/avro:get("employer") eq "Example Inc"
return
   text:put($person/full_name)

You can use XQuery fn:nilled function to test for null values. This example returns true if the map entry is null:

$var/avro:get("key")/nilled()
6.1.5.1.3 Reading Arrays

Oracle XQuery for Hadoop converts Avro array values to an element that contains a child <oxh:item> element for each item in the array. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonScores",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "scores", "type": 
         {"type": "array", "items": "int"}
      }
   ]
}

This is an instance of the schema as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <scores>
      <oxh:item>128</oxh:item>
      <oxh:item>151</oxh:item>
      <oxh:item>110</oxh:item>
   </scores>
</oxh:item>

The following example queries a file named person-scores.avro that contains PersonScores records, and writes the sum and count of scores for each person:

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-scores.avro")
let $scores := $person/scores/*
return
   text:put($person/full_name || "," || sum($scores) || "," || count($scores))

You can access a specific element of an array by using a numeric XPath predicate. For example, this path expression selects the second score. XPath indexing starts at 1 (not 0).

$person/scores/oxh:item[2]
6.1.5.1.4 Reading Unions

Oracle XQuery for Hadoop converts an instance of an Avro union type based on the actual member type of the value. The name of the member type is added as an XML avro:type attribute to the enclosing element, which ensures that queries can distinguish between instances of different member types. However, the attribute is not added for trivial unions where there are only two member types and one of them is null.

For example, consider the following union of two records:

[
   {
      "type": "record", 
      "name": "Person1",
      "fields" : [
         {"name": "full_name", "type": "string"}
      ]
   }
   ,
   {
      "type": "record", 
      "name": "Person2",
      "fields" : [
         {"name": "fname", "type": "string"}
      ]
   }
]
 

This is an instance of the schema as XML:

<oxh:item avro:type="Person2">
   <fname>John Doe</fname>
</oxh:item>

The following example queries a file named person-union.avro that contains instances of the previous union schema, and writes the names of the people from both record types to a text file:

import module "oxh:avro";
import module "oxh:text";

for $person in avro:collection-avroxml("examples/person-union.avro")
return
   if ($person/@avro:type eq "Person1") then
      text:put($person/full_name)
   else if ($person/@avro:type eq "Person2") then
      text:put($person/fname)
   else
      error(xs:QName("UNEXPECTED"), "Unexpected record type:" || $person/@avro:type)
6.1.5.1.5 Reading Primitives

The following table shows how Oracle XQuery for Hadoop maps Avro primitive types to XQuery atomic types.

Table 6-1 Mapping Avro Primitive Types to XQuery Atomic Types

Avro XQuery

boolean

xs:boolean

int

xs:int

long

xs:long

float

xs:float

double

xs:double

bytes

xs:hexBinary

string

xs:string

Avro null values are mapped to empty nilled elements. To distinguish between a null string value and an empty string value, use the XQuery nilled function. This path expression only returns true if the field value is null:

$record/field/fn:nilled()

Avro fixed values are mapped to xs:hexBinary, and enums are mapped to xs:string.

6.1.5.2 Writing XML as Avro

Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml method, which you can use with the put functions to write XML as Avro. The following topics describe how the XML is converted to an Avro instance:

6.1.5.2.1 Writing Records

Oracle XQuery for Hadoop maps the XML to an Avro record schema by matching the child element names to the field names of the record. For example, consider the following Avro schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

You can use the following XML element to write an instance of this record in which the full_name field is John Doe and the age field is 46. The name of the root element (Person) is inconsequential. Only the names of the child elements are used to map to the Avro record fields (full_name and age).

<person>
   <full_name>John Doe</full_name>
   <age>46</age>
</person>

The next example uses the following CSV file named people.csv:

John Doe,46
Jane Doe,37
     .
     .
     .
 

This query converts values from the CSV file to Avro Person records:

import module "oxh:avro";
import module "oxh:text";
 
declare 
   %avro:put("avroxml")
   %avro:schema('
      {
         "type": "record", 
         "name": "Person",
         "fields" : [
            {"name": "full_name", "type": "string"},
            {"name": "age", "type": ["int", "null"] }
         ]
      }
   ')
function local:put-person($person as element()) external;

for $line in text:collection("people.csv")
let $split := tokenize($line, ",")
return
   local:put-person(
      <person>
         <full_name>{$split[1]}</full_name>
         <age>{$split[2]}</age>
      </person>
   )
 

For null values, you can omit the element or set the xsi:nil="true" attribute. For example, this modified query sets age to null when the value is not numeric:

   .
   .
   .
for $line in text:collection("people.csv")
let $split := tokenize($line, ",")
return
   local:put-person(
      <person>
         <full_name>{$split[1]}</full_name>
         {
         if ($split[2] castable as xs:int) then
            <age>{$split[2]}</age>
         else 
            ()
         }
      </person>
   )

In the case of nested records, the values are obtained from nested elements. The next example uses the following schema:

{
"type": "record", 
"name": "PersonAddress",
"fields" : [
   {"name": "full_name", "type": "string"},
   {"name": "address", "type": 
      { "type" : "record",
        "name" : "Address",
        "fields" : [
           { "name" : "street", "type" : "string" },
           { "name" : "city", "type" : "string" }
        ]
      }
   }
]
}

You can use following XML to write an instance of this record:

<person>
   <full_name>John Doe</full_name>
   <address>
      <street>123 First St.</street>
      <city>New York</city>
   </address>
</person>
6.1.5.2.2 Writing Maps

Oracle XQuery for Hadoop converts XML to an Avro map with one map entry for each <oxh:entry> child element. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonProperties",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "properties", "type": 
         {"type": "map", "values": "string"}
      }
   ]
}

You can use the following XML element to write an instance of this schema in which the full_name field is John Doe, and the properties field is set to a map with three entries:

<person>
   <full_name>John Doe</full_name>
   <properties>
      <oxh:entry key="hair color">brown</oxh:entry>
      <oxh:entry key="favorite author">George RR Martin</oxh:entry>
      <oxh:entry key="employer">Example Inc</oxh:entry>
   </properties>
</person>
6.1.5.2.3 Writing Arrays

Oracle XQuery for Hadoop converts XML to an Avro array with one item for each <oxh:item> child element. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonScores",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "scores", "type": 
         {"type": "array", "items": "int"}
      }
   ]
}

You can use the following XML element to write an instance of this schema in which the full_name field is John Doe and the scores field is set to [128, 151, 110]:

<person>
   <full_name>John Doe</full_name>
   <scores>
      <oxh:item>128</oxh:item>
      <oxh:item>151</oxh:item>
      <oxh:item>110</oxh:item>
   </scores>
</person>
6.1.5.2.4 Writing Unions

When writing an Avro union type, Oracle XQuery for Hadoop bases the selection of a member type on the value of the avro:type attribute.

This example uses the following schema:

[
   {
      "type": "record", 
      "name": "Person1",
      "fields" : [
         {"name": "full_name", "type": "string"}
      ]
   }
   ,
   {
      "type": "record", 
      "name": "Person2",
      "fields" : [
         {"name": "fname", "type": "string"}
      ]
   }
]

The following XML is mapped to an instance of the Person1 record:

<person avro:type="Person1">
   <full_name>John Doe</full_name>
</person>

This XML is mapped to an instance of the Person2 record:

<person avro:type="Person2">
   <fname>John Doe</fname>
</person>

The avro:type attribute selects the member type of the union. For trivial unions that contain a null and one other type, the avro:type attribute is unnecessary. If the member type cannot be determined, then an error is raised.

6.1.5.2.5 Writing Primitives

To map primitive values, Oracle XQuery for Hadoop uses the equivalent data types shown in Table 6-1 to cast an XML value to the corresponding Avro type. If the value cannot be converted to the Avro type, then an error is raised.

This example uses the following schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

Attempting to map the following XML to an instance of this schema raises an error, because the string value apple cannot be converted to an int:

<person>
   <full_name>John Doe</full_name>
   <age>apple</age>
</person>

6.2 JSON File Adapter

The JSON file adapter provides access to JSON files stored in HDFS. It also contains functions for working with JSON data embedded in other file formats. For example, you can query JSON that is stored as lines in a large text file by using json:parse-as-xml with the text:collection function.

Processing a single JSON file in parallel is not currently supported. A set of JSON files can be processes in parallel, with sequential processing of each file.

The JSON module is described in the following topics:

6.2.1 Built-in Functions for Reading JSON

To use the built-in functions in your query, you must import the JSON file adapter as follows:

import module "oxh:json";

The JSON module contains the following functions:

As of Big Data Connectors Release 4.9, Oracle XQuery for Hadoop also supports XQuery 3.1 including the standard facilities for processing JSON, including: fn:parse-json, fn:json-to-xml, and fn:xml-to-json

6.2.1.1 json:collection-jsonxml

Accesses a collection of JSON files in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task.

The JSON file adapter automatically decompresses files compressed with a Hadoop-supported compression codec.

Signature

json:collection-jsonxml($uris as xs:string*) as element()* external;

Parameters

$uris: The JSON file URIs

Returns

XML elements that model the JSON values. See "About Converting JSON Data Formats to XML."

6.2.1.2 json:parse-as-xml

Parses a JSON value as XML.

Signature

json:parse-as-xml($arg as xs:string?) as element(*)?

Parameters

$arg: Can be the empty sequence.

Returns

An XML element that models the JSON value. An empty sequence if $arg is an empty sequence. See "About Converting JSON Data Formats to XML."

6.2.1.3 json:get

Retrieves an entry from a JSON object modeled as XML. See "About Converting JSON Data Formats to XML."

Signature

json:get($key as xs:string?, $obj as node()?) as element(oxh:entry)?

json:get($key as xs:string?) as element(oxh:entry)?

Parameters

$key: The JSON data key

$obj: The JSON object value

Returns

The value of the following XPath expression:

$obj/oxh:entry[@key eq $key]

If $input not present, then the behavior is identical to calling the two-argument function using the context item for $obj. See the Notes.

Notes

These function calls are equivalent:

$var/json:get("key")

json:get("key", $var)

$var/oxh:entry[@key eq "key"]
 

$var is a JSON object modeled as XML. See "Reading Maps."

6.2.2 Custom Functions for Reading JSON Files

You can use the following annotations to define functions that read collections of JSON files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading JSON files must have the following signature:

declare %json:collection("jsonxml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as element()* external;

Annotations

%json:collection("jsonxml")

Declares the collection function. The annotation parameter must be jsonxml.

%output:encoding("charset")

Identifies the text encoding of the input files.

The valid encodings are those supported by the JVM. If this annotation is omitted, then the encoding is automatically detected from the JSON file as UTF-8, UTF-16 big-endian serialization (BE) or little-endian serialization (LE), or UTF-32 (BE or LE).

For better performance, omit the encoding annotation if the actual file encoding is specified by JSON Request for Comment 4627, Section 3 "Encoding," on the Internet Engineering Task Force (IETF) website at

http://www.ietf.org/rfc/rfc4627.txt

Parameters

$uris as xs:string*

Lists the JSON file URIs. Required.

Returns

A collection of XML elements. Each element models the corresponding JSON value. See "About Converting JSON Data Formats to XML."

6.2.3 Examples of JSON Functions

Example 6-3

This example uses the following JSON text files stored in HDFS:

mydata/users1.json
[
{ "user" : "john", "full name" : "John Doe", "age" : 45 },
{ "user" : "kelly", "full name" : "Kelly Johnson", "age" : 32 }
]
 
mydata/users2.json 
[
{ "user" : "laura", "full name" : "Laura Smith", "age" : null },
{ "user" : "phil", "full name" : "Phil Johnson", "age" : 27 }
]
 

The following query selects names of users whose last name is Johnson from users1.json and users2.json

import module "oxh:text";
import module "oxh:json";
 
for $user in json:collection-jsonxml("mydata/users*.json")/oxh:item
let $fullname := $user/json:get("full name")
where tokenize($fullname, "\s+")[2] eq "Johnson"
return 
   text:put-text($fullname)

This query generates text files that contain the following lines:

Phil Johnson
Kelly Johnson

The remaining examples query the following text file in HDFS:

mydata/users-json.txt
 
{ "user" : "john", "full name" : "John Doe", "age" : 45 }
{ "user" : "kelly", "full name" : "Kelly Johnson", "age" : 32 }
{ "user" : "laura", "full name" : "Laura Smith", "age" : null }
{ "user" : "phil", "full name" : "Phil Johnson", "age" : 27 }

Example 6-4

The following query selects the names of users that are older than 30 from users-json.txt:

import module "oxh:text";
import module "oxh:json";

for $line in text:collection("mydata/users-json.txt")
let $user := json:parse-as-xml($line)
where $user/json:get("age") gt 30
return 
   text:put($user/json:get("full name"))

This query generates text files that contain the following lines:

John Doe
Kelly Johnson

Example 6-5

The next query selects the names of employees that have a null value for age from users-json.txt:

import module "oxh:text";
import module "oxh:json";

for $line in text:collection("mydata/users-json.txt")
let $user := json:parse-as-xml($line)
where $user/json:get("age")/nilled()
return 
   text:put($user/json:get("full name"))
 

This query generates a text file that contains the following line:

Laura Smith

6.2.4 JSON File Adapter Configuration Properties

Oracle XQuery for Hadoop uses the generic options for specifying configuration properties in the hadoop command. You can use the -conf option to identify configuration files, and the -D option to specify individual properties.

The following configuration properties are equivalent to the Jackson parser options with the same names. You can enter the option name in either upper or lower case. For example, oracle.hadoop.xquery.json.parser.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER and oracle.hadoop.xquery.json.parser.allow_backslash_escaping_any_character are equal.

oracle.hadoop.xquery.json.parser.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER

Type: Boolean

Default Value: false

Description: Enables any character to be escaped with a backslash (\). Otherwise, only the following characters can be escaped: quotation mark("), slash (/), backslash (\), backspace, form feed (f), new line (n), carriage return (r), horizontal tab (t), and hexadecimal representations (unnnn)

oracle.hadoop.xquery.json.parser.ALLOW_COMMENTS

Type: Boolean

Default Value: false

Description: Allows Java and C++ comments (/* and //) within the parsed text.

oracle.hadoop.xquery.json.parser.ALLOW_NON_NUMERIC_NUMBERS

Type: Boolean

Default Value: false

Description: Allows Not a Number (NaN) tokens to be parsed as floating number values.

oracle.hadoop.xquery.json.parser.ALLOW_NUMERIC_LEADING_ZEROS

Type: Boolean

Default Value: false

Description: Allows integral numbers to start with zeroes, such as 00001. The zeros do not change the value and can be ignored.

oracle.hadoop.xquery.json.parser.ALLOW_SINGLE_QUOTES

Type: Boolean

Default Value: false

Description: Allow single quotes (') to delimit string values.

oracle.hadoop.xquery.json.parser.ALLOW_UNQUOTED_CONTROL_CHARS

Type: Boolean

Default Value: false

Description: Allows JSON strings to contain unquoted control characters (that is, ASCII characters with a decimal value less than 32, including the tab and line feed).

oracle.hadoop.xquery.json.parser.ALLOW_UNQUOTED_FIELD_NAMES

Type: Boolean

Default Value: false

Description: Allows unquoted field names, which are allowed by Javascript but not the JSON specification.

Related Topics

6.2.5 About Converting JSON Data Formats to XML

This section describes how JSON data formats are converted to XML. It contains the following topics:

As of Big Data Connectors Release 4.9, Oracle XQuery for Hadoop also supports XQuery 3.1 including the standard facilities for processing JSON, including: fn:parse-json, fn:json-to-xml, and fn:xml-to-json

6.2.5.1 About Converting JSON Objects to XML

JSON objects are similar to Avro maps and are converted to the same XML structure. See "Reading Maps."

For example, the following JSON object is converted to an XML element:

{ 
   "user" : "john", 
   "full_name" : "John Doe", 
   "age" : 45 
}

The object is modeled as the following element:

<oxh:item>
    <oxh:entry key="user">john</oxh:entry>
    <oxh:entry key="full_name">John Doe</oxh:entry>
    <oxh:entry key="age">45</oxh:entry>
</oxh:item>

6.2.5.2 About Converting JSON Arrays to XML

JSON arrays are similar to Avro arrays and are converted to the same XML structure. See "Reading Arrays."

For example, the following JSON array is converted to an XML element:

[ "red", "blue", "green" ]

The array is modeled as the following element:

<oxh:item>
   <oxh:item>red</oxh:item>
   <oxh:item>blue</oxh:item>
   <oxh:item>green</oxh:item>
</oxh:item>

6.2.5.3 About Converting Other JSON Types

The other JSON values are mapped as shown in the following table.

Table 6-2 JSON Type Conversions

JSON XML

null

An empty (nilled) element

true/false

xs:boolean

number

xs:decimal

string

xs:string

6.3 Oracle Database Adapter

The Oracle Database adapter provides custom functions for loading data into tables in Oracle Database.

A custom put function supported by this adapter automatically calls Oracle Loader for Hadoop at run time, either to load the data immediately or to output it to HDFS. You can declare and use multiple custom Oracle Database adapter put functions within a single query. For example, you might load data into different tables or into different Oracle databases with a single query.

Ensure that Oracle Loader for Hadoop is installed on your system, and that the OLH_HOME environment variable is set to the installation directory. See Step 3 of "Installing Oracle XQuery for Hadoop." Although not required, you might find it helpful to familiarize yourself with Oracle Loader for Hadoop before using this adapter.

The Oracle Database adapter is described in the following topics:

See Also:

"Software Requirements" for the versions of Oracle Database that Oracle Loader for Hadoop supports

6.3.1 Custom Functions for Writing to Oracle Database

You can use the following annotations to define functions that write to tables in an Oracle database either directly or by generating binary or text files for subsequent loading with another utility, such as SQL*Loader.

Signature

Custom functions for writing to Oracle database tables must have the following signature:

declare %oracle:put(["jdbc" | "oci" | "text" | "datapump"])
   [%oracle:columns(col1 [, col2...])] [%oracle-property annotations]
   function local:myPut($column1 [as xs:allowed_type_name[?]], [$column2 [as xs:allowed_type_name[?]], ...]) external;

Annotations

%oracle:put("output_mode"?)

Declares the put function and the output mode. Required.

The optional output_mode parameter can be one of the following string literal values:

For Oracle XQuery for Hadoop to write directly to an Oracle database table using either JDBC or OCI, all systems involved in processing the query must be able to connect to the Oracle Database system. See "About the Modes of Operation."

%oracle:columns(col1 [, col2...])

Identifies a selection of one or more column names in the target table. The order of column names corresponds to the order of the function parameters. See "Parameters." Optional.

This annotation enables loading a subset of the table columns. If omitted, the put function attempts to load all columns of the target table.

%oracle-property:property_name (value)

Controls various aspects of connecting to the database and writing data. You can specify multiple %oracle-property annotations. These annotations correspond to the Oracle Loader for Hadoop configuration properties. Every %oracle-property annotation has an equivalent Oracle Loader for Hadoop configuration property. "Oracle Loader for Hadoop Configuration Properties and Corresponding %oracle-property Annotations" explains this relationship in detail.

The %oracle-property annotations are optional. However, the various loading scenarios require you to specify some of them or their equivalent configuration properties. For example, to load data into an Oracle database using JDBC or OCI, you must specify the target table and the connection information.

The following example specifies a target table named VISITS, a user name of db, a password of password, and the URL connection string:

%oracle-property:targetTable('visits')
%oracle-property:connection.user('db')
%oracle-property:connection.password('password')
%oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com')

Parameters

$column1 [as xs:allowed_type_name[?]], [$column2 [as xs:allowed_type_name[?]],...]

Enter a parameter for each column in the same order as the Oracle table columns to load all columns, or use the %oracle:columns annotation to load selected columns.

Because the correlation between parameters and database columns is positional, the name of the parameter (column1 in the parameter syntax) is not required to match the name of the database column.

You can omit the explicit as xs:allowed_type_name type declaration for any parameter. For example, you can declare the parameter corresponding to a NUMBER column simply as $column1. In this case, the parameter is automatically assigned an XQuery type of item()*. At run time, the input value is cast to the allowed XQuery type for the corresponding table column type, as described in the following table. For example, data values that are mapped to a column with a NUMBER data type are automatically cast as xs:decimal. An error is raised if the cast fails.

Alternatively, you can specify the type or its subtype for any parameter. In this case, compile-time type checking is performed. For example, you can declare a parameter corresponding to a NUMBER column as $column as xs:decimal. You can also declare it as any subtype of xs:decimal, such as xs:integer.

You can include the ? optional occurrence indicator for each specified parameter type. This indicator allows the empty sequence to be passed as a parameter value at run time, so that a null is inserted into the database table. Any occurrence indicator other than ? raises a compile-time error.

The following table describes the appropriate mappings of XQuery data types with the supported Oracle Database data types. In addition to the listed XQuery data types, you can also use the subtypes, such as xs:integer instead of xs:decimal. Oracle data types are more restrictive than XQuery data types, and these restrictions are identified in the table.

Table 6-3 Data Type Mappings Between Oracle Database and XQuery

Database Type XQuery Type

VARCHAR2

xs:string

Limited by the VARCHAR2 maximum size of 4000 bytes.

CHAR

xs:string

Limited by the CHAR maximum size of 2000 bytes.

NVARCHAR2

xs:string

Limited by the NVARCHAR2 maximum size of 4000 bytes.

NCHAR

xs:string

Limited by the NCHAR maximum size of 2000 bytes.

DATE

xs:dateTime

Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. If a time zone is specified in the xs:dateTime value, then the time zone information is dropped. Fractional seconds are also dropped. A time value of 24:00:00 is not valid.

TIMESTAMP

xs:dateTime

Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. If a time zone is specified in the xs:dateTime value, then the time zone information is dropped. Fractional seconds are limited to a precision of 0 to 9 digits. A time value of 24:00:00 is not valid.

TIMESTAMP W LOCAL TIME ZONE

xs:dateTime

Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. In the offset from UTC, the time-zone hour field is limited to -12:00 to 14:00. Fractional seconds are limited to a precision of 0 to 9 digits.

See "About Session Time Zones."

TIMESTAMP W TIME ZONE

xs:dateTime

Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. In the offset from UTC, the time-zone hour field is limited to -12:00 to 14:00. Fractional seconds are limited to a precision of 0 to 9 digits.

See "About Session Time Zones."

INTERVAL DAY TO SECOND

xs:dateTimeDuration

The day and fractional seconds are limited by a precision of 0 to 9 digits each. The hour is limited to a range of 0 to 23, and minutes and seconds are limited to a range of 0 to 59.

INTERVAL YEAR TO MONTH

xs:yearMonthDuration

The year is limited by a precision of 0 to 9 digits, and the month is limited to a range of 0 to 11.

BINARY_FLOAT

xs:float

BINARY_DOUBLE

xs:double

NUMBER

xs:decimal

Limited by the NUMBER precision of 1 to 38 decimal digits and scale of -84 to 127 decimal digits.

FLOAT

xs:decimal

Limited by the FLOAT precision of 1 to 126 binary digits.

RAW

xs:hexBinary

Limit by the RAW maximum size of 2000 bytes.

About Session Time Zones

If an xs:dateTime value with no time zone is loaded into TIMESTAMP W TIME ZONE or TIMESTAMP W LOCAL TIME ZONE, then the time zone is set to the value of the sessionTimeZone parameter, which defaults to the JVM time zone. Using Oracle XQuery for Hadoop, you can set the sessionTimeZone property, as described in "Oracle Loader for Hadoop Configuration Properties and Corresponding %oracle-property Annotations."

Notes

With JDBC or OCI output modes, the Oracle Database Adapter loads data directly into the database table. It also creates a directory with the same name as the custom put function name, under the query output directory. For example, if your query output directory is myoutput, and your custom function is myPut, then the myoutput/myPut directory is created.

For every custom Oracle Database Adapter put function, a separate directory is created. This directory contains output produced by the Oracle Loader for Hadoop job. When you use datapump or text output modes, the data files are written to this directory. The control and SQL scripts for loading the files are written to the _olh subdirectory, such as myoutput/myPut/_olh.

For descriptions of the generated files, see "Delimited Text Output Format" and "Oracle Data Pump Output Format."

6.3.2 Examples of Oracle Database Adapter Functions

These examples use the following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited:

mydata/visits1.log  
 
2013-10-28T06:00:00, john, index.html, 200
2013-10-28T08:30:02, kelly, index.html, 200
2013-10-28T08:32:50, kelly, about.html, 200
2013-10-30T10:00:10, mike, index.html, 401

mydata/visits2.log  

2013-10-30T10:00:01, john, index.html, 200
2013-10-30T10:05:20, john, about.html, 200
2013-11-01T08:00:08, laura, index.html, 200
2013-11-04T06:12:51, kelly, index.html, 200
2013-11-04T06:12:40, kelly, contact.html, 200

The examples also use the following file in HDFS, which contains anonymous page visits:

mydata/anonvisits.log
 
2011-10-30T10:01:01, index.html, 401
2011-11-04T06:15:40, contact.html, 401

This SQL command creates the VISITS table in the Oracle database:

CREATE TABLE visits (time TIMESTAMP, name VARCHAR2(15), page VARCHAR2(15), code NUMBER)

Example 6-6 Loading All Columns

The first query loads all information related to the page visit (time of visit, user name, page visited, and status code) to the VISITS table. For anonymous access, the user name is missing, therefore the query specifies () to insert a null into the table. The target table name, user name, password, and connection URL are specified with %oracle-property annotations.

The example uses a clear-text user name and password, which is insecure but acceptable in a development environment. Oracle recommends that you use a wallet instead for security, especially in a production application. You can configure an Oracle wallet using either Oracle Loader for Hadoop properties or their equivalent %oracle-property annotations. The specific properties that you must set are described in "Providing the Connection Details for Online Database Mode."

import module "oxh:text";
 
declare
   %oracle:put
   %oracle-property:targetTable('visits')
   %oracle-property:connection.user('db')
   %oracle-property:connection.password('password')
   %oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com')
function local:myPut($c1, $c2, $c3, $c4) external;
 
for $line in text:collection("mydata/*visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
return
   if (count($split) > 3) then
      local:myPut($split[1], $split[2], $split[3], $split[4])
   else
      local:myPut($split[1], (), $split[2], $split[3])
 

The VISITS table contains the following data after the query runs:

TIME                           NAME            PAGE                  CODE
------------------------------ --------------- --------------- ----------
30-OCT-13 10.00.01.000000 AM   john            index.html             200
30-OCT-13 10.05.20.000000 AM   john            about.html             200
01-NOV-13 08.00.08.000000 AM   laura           index.html             200
04-NOV-13 06.12.51.000000 AM   kelly           index.html             200
04-NOV-13 06.12.40.000000 AM   kelly           contact.html           200
28-OCT-13 06.00.00.000000 AM   john            index.html             200
28-OCT-13 08.30.02.000000 AM   kelly           index.html             200
28-OCT-13 08.32.50.000000 AM   kelly           about.html             200
30-OCT-13 10.00.10.000000 AM   mike            index.html             401
30-OCT-11 10.01.01.000000 AM                   index.html             401
04-NOV-11 06.15.40.000000 AM                   contact.html           401

Example 6-7 Loading Selected Columns

This example uses the %oracle:columns annotation to load only the time and name columns of the table. It also loads only visits by john.

The column names specified in %oracle:columns are positionally correlated to the put function parameters. Data values provided for the $c1 parameter are loaded into the TIME column, and data values provided for the $c2 parameter are loaded into the NAME column.

import module "oxh:text";

declare
   %oracle:put
   %oracle:columns('time', 'name')
   %oracle-property:targetTable('visits')
   %oracle-property:connection.user('db')
   %oracle-property:connection.password('password')
   %oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com')
function local:myPut($c1, $c2) external;
 
for $line in text:collection("mydata/*visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
where $split[2] eq 'john'
return
   local:myPut($split[1], $split[2])
 

If the VISITS table is empty before the query runs, then it contains the following data afterward:

TIME                           NAME            PAGE                 CODE
------------------------------ --------------- --------------- ----------
30-OCT-13 10.00.01.000000 AM   john
30-OCT-13 10.05.20.000000 AM   john
28-OCT-13 06.00.00.000000 AM   john

6.3.3 Oracle Loader for Hadoop Configuration Properties and Corresponding %oracle-property Annotations

When you use the Oracle Database adapter of Oracle XQuery for Hadoop, you indirectly use Oracle Loader for Hadoop. Oracle Loader for Hadoop defines configuration properties that control various aspects of connecting to Oracle Database and writing data. Oracle XQuery for Hadoop supports many of these properties, which are listed in the last column of the table below.

You can specify these properties with the generic -conf and -D hadoop command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Oracle Database adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop command-line options.

Alternatively, you can specify these properties as Oracle Database adapter put function annotations with the %oracle-property prefix. These annotations are listed in the second column of the table below. Annotations apply only to the particular Oracle Database adapter put function that contains them in its declaration.

For example, you can set the target table to VISITS by adding the following lines to the configuration file, and identifying the configuration file with the -conf option:

<property>
   <name>oracle.hadoop.loader.targetTable</name>
   <value>visits</value>
</property>
 

You can also set the target table to VISITS with the -D option, using the same Oracle Loader for Hadoop property:

-D oracle.hadoop.loader.targetTable=visits
 

Both methods set the target table to VISITS for all Oracle Database adapter put functions in your query.

Alternatively, this annotation sets the target table to VISITS only for the particular put function that has the annotation in the declaration:

%oracle-property:connection.url('visits')
 

This flexibility is provided for convenience. For example, if a query has multiple Oracle Database adapter put functions, each writing to a different table in the same database, then the most convenient way to specify the necessary information is like this:

  • Use the oracle.hadoop.loader.connection.url property in the configuration file to specify the database connection URL. Then identify the configuration file using the -conf option. This option sets the same database connection URL for all Oracle Database adapter put functions in your query.

  • Set a different table name using the %oracle-property:targetTable annotation in each Oracle Database adapter put function declaration.

The following table identifies the Oracle Loader for Hadoop properties and their equivalent Oracle XQuery for Hadoop annotations by functional category. Oracle XQuery for Hadoop supports only the Oracle Loader for Hadoop properties listed in this table.

Table 6-4 Configuration Properties and Corresponding %oracle-property Annotations

Category Property Annotation

Connection

oracle.hadoop.loader.connection.defaultExecuteBatch

%oracle-property:connection.defaultExecuteBatch

Connection

oracle.hadoop.loader.connection.oci_url

%oracle-property:connection.oci_url

Connection

oracle.hadoop.loader.connection.password

%oracle-property:connection.password

Connection

oracle.hadoop.loader.connection.sessionTimeZone

%oracle-property:connection.sessionTimeZone

Connection

oracle.hadoop.loader.connection.tns_admin

%oracle-property:connection.tns_admin

Connection

oracle.hadoop.loader.connection.tnsEntryName

%oracle-property:connection.tnsEntryName

Connection

oracle.hadoop.loader.connection.url

%oracle-property:connection.url

Connection

oracle.hadoop.loader.connection.user

%oracle-property:connection.user

Connection

oracle.hadoop.loader.connection.wallet_location

%oracle-property:connection.wallet_location

General

oracle.hadoop.loader.badRecordFlushInterval

%oracle-property:badRecordFlushInterval

General

oracle.hadoop.loader.compressionFactors

%oracle-property:compressionFactors

General

oracle.hadoop.loader.enableSorting

%oracle-property:enableSorting

General

oracle.hadoop.loader.extTabDirectoryName

%oracle-property:extTabDirectoryName

General

oracle.hadoop.loader.loadByPartition

%oracle-property:loadByPartition

General

oracle.hadoop.loader.logBadRecords

%oracle-property:logBadRecords

General

oracle.hadoop.loader.rejectLimit

%oracle-property:rejectLimit

General

oracle.hadoop.loader.sortKey

%oracle-property:sortKey

General

oracle.hadoop.loader.tableMetadataFile

%oracle-property:tableMetadataFile

General

oracle.hadoop.loader.targetTable

%oracle-property:targetTable

Output

oracle.hadoop.loader.output.dirpathBufsize

%oracle-property:dirpathBufsize

Output

oracle.hadoop.loader.output.escapeEnclosers

%oracle-property:output.escapeEnclosers

Output

oracle.hadoop.loader.output.fieldTerminator

%oracle-property:output.fieldTerminator

Output

oracle.hadoop.loader.output.granuleSize

%oracle-property:output.granuleSize

Output

oracle.hadoop.loader.output.initialFieldEncloser

%oracle-property:output.initialFieldEncloser

Output

oracle.hadoop.loader.output.trailingFieldEncloser

%oracle-property:output.trailingFieldEncloser

Sampler

oracle.hadoop.loader.sampler.enableSampling

%oracle-property:sampler.enableSampling

Sampler

oracle.hadoop.loader.sampler.hintMaxSplitSize

%oracle-property:sampler.hintMaxSplitSize

Sampler

oracle.hadoop.loader.sampler.hintNumMapTasks

%oracle-property:sampler.hintNumMapTask

Sampler

oracle.hadoop.loader.sampler.loadCI

%oracle-property:sampler.loadCI

Sampler

oracle.hadoop.loader.sampler.maxHeapBytes

%oracle-property:sampler.maxHeapBytes

Sampler

oracle.hadoop.loader.sampler.maxLoadFactor

%oracle-property:sampler.maxLoadFactor

Sampler

oracle.hadoop.loader.sampler.maxSamplesPct

%oracle-property:sampler.maxSamplesPct

Sampler

oracle.hadoop.loader.sampler.minSplits

%oracle-property:sampler.minSplits

Sampler

oracle.hadoop.loader.sampler.numThreads

%oracle-property:sampler.numThreads

6.4 Oracle NoSQL Database Adapter

This adapter provides functions to read and write values stored in Oracle NoSQL Database.

This adapter is described in the following topics:

6.4.1 Prerequisites for Using the Oracle NoSQL Database Adapter

Before you write queries that use the Oracle NoSQL Database adapter, you must configure Oracle XQuery for Hadoop to use your Oracle NoSQL Database server.

You must set the following:

  • The KVHOME environment variable to the local directory containing the Oracle NoSQL database lib directory.

  • The oracle.kv.hosts and oracle.kv.kvstore configuration properties.

  • The OXH_SOLR_MR_HOME environment variable to the local directory containing search-mr-<version>.jar and search-mr-<version>-job.jar, only when Tika parser is invoked. That is, only when kv:collection-tika() or kv:get-tika() functions are invoked or, %kv:collection('tika') or %kv:get('tika') annotations are used with external functions.

You can set the configuration properties using either the -D or -conf options in the hadoop command when you run the query. See "Running Queries."

This example sets KVHOME and uses the hadoop -D option in a query to set oracle.kv.kvstore:

$ export KVHOME=/local/path/to/kvstore/
$ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.kv.hosts=example.com:5000 -D oracle.kv.kvstore=kvstore ./myquery.xq -output ./myoutput

This example sets OXH_SOLR_MR_HOME environment variable when the Tika parser is invoked:

$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr

Note:

The HADOOP_CLASSPATH environment variable or -libjars command line option must not contain NoSQL DB jars.

See "Oracle NoSQL Database Adapter Configuration Properties."

6.4.2 Built-in Functions for Reading from and Writing to Oracle NoSQL Database

To use the built-in functions in your query, you must import the Oracle NoSQL Database module as follows

import module "oxh:kv";

The Oracle NoSQL Database module contains the following functions:

6.4.2.1 kv:collection-text

Accesses a collection of values in the database. Each value is decoded as UTF-8 and returned as a string.

Signature

declare %kv:collection("text") function 
   kv:collection-text($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as xs:string* external;

declare %kv:collection("text") function 
   kv:collection-text($parent-key as xs:string?, $depth as xs:int?) as xs:string* external;

declare %kv:collection("text") function 
   kv:collection-text($parent-key as xs:string?) as xs:string* external;

Parameters

See "Parameters." Omitting $subrange is the same as specifying $subrange(). Likewise, omitting $depth is the same as specifying $depth().

Returns

One string for each value

6.4.2.2 kv:collection-avroxml

Accesses a collection of values in the database. Each value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."

Signature

declare %kv:collection("avroxml") function 
   kv:collection-avroxml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as element()* external;

declare %kv:collection("avroxml") function 
   kv:collection-avroxml($parent-key as xs:string?, $depth as xs:int?) as element()* external;

declare %kv:collection("avroxml") function 
   kv:collection-avroxml($parent-key as xs:string?) as element()* external;

Parameters

See "Parameters." Omitting $subrange is the same as specifying $subrange(). Likewise, omitting $depth is the same as specifying $depth().

Returns

One XML element for each Avro record

6.4.2.3 kv:collection-xml

Accesses a collection of values in the database. Each value is read as a sequence of bytes and parsed as XML.

Signature

declare %kv:collection("xml") function 
   kv:collection-xml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external;

declare %kv:collection("xml") function 
   kv:collection-xml($parent-key as xs:string?, $depth as xs:int?) as document-node()* external;

declare %kv:collection("xml") function 
   kv:collection-xml($parent-key as xs:string?) as document-node()* external;

Parameters

See "Parameters." Omitting $subrange is the same as specifying $subrange(). Likewise, omitting $depth is the same as specifying $depth().

Returns

One XML document for each value.

6.4.2.4 kv:collection-binxml

Accesses a collection of values in the database. Each value is read as XDK binary XML and returned as an XML document.

Signature

declare %kv:collection("binxml") function 
   kv:collection-binxml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external;
declare %kv:collection("binxml") function 
   kv:collection-binxml($parent-key as xs:string?, $depth as xs:int?) as document-node()* external;
declare %kv:collection("binxml") function
   kv:collection-binxml($parent-key as xs:string?) as document-node()* external;

Parameters

See "Parameters." Omitting $subrange is the same as specifying $subrange(). Likewise, omitting $depth is the same as specifying $depth().

Returns

One XML document for each value.

6.4.2.5 kv:collection-tika

Uses Tika to parse the specified value when invoked and returns as a document node.

Signature

declare %kv:collection("tika") function
kv:collection-tika($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) $contentType as xs:string?) as document-node()* external;

Parameters

See "Parameters." Omitting $subrange is the same as specifying $subrange(). Likewise, omitting $depth is the same as specifying $depth().

Returns

One document node for each value.

6.4.2.6 kv:put-text

Writes a key-value pair. The $value is encoded as UTF-8.

Signature

declare %kv:put("text") function
   kv:put-text($key as xs:string, $value as xs:string) external;

6.4.2.7 kv:put-xml

Writes a key/value pair. The $xml is serialized and encoded as UTF-8.

Signature

declare %kv:put("xml") function 
   kv:put-xml($key as xs:string, $xml as node()) external;

6.4.2.8 kv:put-binxml

Puts a key/value pair. The $xml is encoded as XDK binary XML. See Oracle XML Developer's Kit Programmer's Guide.

Signature

declare %kv:putkv:put-binxml("binxml") function
   ($key as xs:string, $xml as node()) external;

6.4.2.9 kv:get-text

Obtains the value associated with the key. The value is decoded as UTF-8 and returned as a string.

Signature

declare %kv:get("text") function 
   kv:get-text($key as xs:string) as xs:string? external;

6.4.2.10 kv:get-avroxml

Obtains the value associated with the key. The value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records.".

Signature

declare %kv:get("avroxml") function 
   kv:get-avroxml($key as xs:string) as element()? external;

6.4.2.11 kv:get-xml

Obtains the value associated with the key. The value is read as a sequence of bytes and parsed as XML.

Signature

declare %kv:get("xml")  function 
   kv:get-xml($key as xs:string) as document-node()? external;

6.4.2.12 kv:get-binxml

Obtains the value associated with the key. The value is read as XDK binary XML and returned as an XML document.

Signature

declare %kv:get("binxml") function
   kv:get-binxml($key as xs:string) as document-node()? external;

6.4.2.13 kv:get-tika

Obtains the value associated with the key. The value is parsed as byte array and returned as a document node.

Signature

declare %kv:get("tika") function
   kv:get-tika($key as xs:string, $contentType as xs:string?) as document-node()? external;

6.4.2.14 kv:key-range

Defines a prefix range. The prefix defines both the lower and upper inclusive boundaries.

Use this function as the subrange argument of a kv:collection function.

Signature

kv:key-range($prefix as xs:string) as xs:string;

6.4.2.15 kv:key-range

Specifies a key range.

Use this function as the subrange argument of a kv:collection function.

Signature

kv:key-range($start as xs:string, $start-inclusive as xs:boolean, $end as xs:string, $end-inclusive as xs:boolean) as xs:string;

Parameters

$start: Defines the lower boundary of the key range.

$start-inclusive: A value of true includes $start in the range, or false omits it.

$end: Defines the upper boundary of the key range. It must be greater than $start.

$end-inclusive: A value of true includes $end in the range, or false omits it.

6.4.3 Built-in Functions for Reading from and Writing to Oracle NoSQL Database using Table API

To use the built-in functions in your query, you must have declared the name space and imported the module as follows:

declare namespace kv-table = "oxh:kv-table";
import module "oxh:kv-table";

The Oracle NoSQL Database through Table API module contains the following functions:

6.4.3.1 kv-table:collection-jsontext

These functions iterate over all or a subset of rows stored in a single table in the NoSQL Database. Each row is returned in a form of a JSON string.

Signature

declare %kv-table:collection-jsontext("jsontext") function 
  kv-table:collection-jsontext($tableName as xs:string) as xs:string*
declare %kv-table:collection(“jsontext") function 
  kv-table:collection-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string*
declare %kv-table:collection(“jsontext") function
  kv-table:collection-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string?, $fieldRangeJsonValue as xs:string?) as xs:string*

Parameters

$tableName as xs:string – name of the table in NoSQL Database

$primaryKeyJsonValue as xs:string? – a partial primary key specified as JSON text

$fieldRangeJsonValue as xs:string? – field range for a remaining field of the given primary key specified as JSON text

{ 
   "name": “fieldname", 
   "start": “startVal", 
   "startInclusive": true|false,  
   "end" : "endVal",
   "endInclusive": true|false   
}

Returns

JSON value of each row

Use "json:parse-as-xml" function to parse JSON string into an XML document

6.4.3.2 kv-table:get-jsontext

This function reads a single row stored in a table in NoSQL Database. The row is returned in a form of a JSON string. If the row is not found, then an empty sequence is returned.

Signature

declare %kv-table:get(“jsontext") function
  kv-table:get-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string) as xs:string?

Parameters

$tableName as xs:string – name of the table in NoSQL Database

$primaryKeyJsonValue as xs:string? – a full primary key specified as JSON text

Returns

JSON value of the row or an empty sequence, if the row is not found.

Use "json:parse-as-xml" function to parse JSON string into an XML document

6.4.3.3 kv-table:put-jsontext

This function writes a row into NoSQL Database using its Table API

Signature

declare %kv-table:put(“jsontext") function
  kv-table:put-jsontext($tableName as xs:string, $jsonValue as xs:string);

Parameters

$tableName as xs:string – name of the table in NoSQL Database

$jsonValue as xs:string – row specified as JSON text

6.4.4 Built-in Functions for Reading from and Writing to Oracle NoSQL Database using Large Object API

To use the built-in functions in your query you must have declared the name space and imported the module as follows:

declare namespace kv-lob = "oxh:kv-lob";
import module "oxh:kv-lob";

The Oracle NoSQL Database through Large Object API module contains the following functions:

6.4.4.1 kv-lob:get-text

Obtains the value associated with the key. The value is decoded as UTF-8 and returned as a string.

Signature

declare %kv-lob:get("text") 
function kv-lob:get-text($key as xs:string) as xs:string?

6.4.4.2 kv-lob:get-xml

Obtains the value associated with the key. The value is read as a sequence of bytes and parsed as XML.

Signature

declare %kv-lob:get("xml") 
function kv-lob:get-xml($key as xs:string) as document-node()?

6.4.4.3 kv-lob:get-binxml

Obtains the value associated with the key. The value is read as XDK binary XML and returned as an XML document. See Oracle XML Developer's Kit Programmer's Guide.

Signature

declare %kv-lob:get("binxml")
function kv-lob:get-binxml($key as xs:string) as document-node()?

6.4.4.4 kv-lob:get-tika

Obtains the value associated with the key. The value is parsed as byte array and returned as a document node.

Signature

declare %kv-lob:get("tika")
function kv-lob:get-tika($key as xs:string) as document-node()?
declare %kv-lob:get("tika")
function kv-lob:get-tika($key as xs:string, $contentType as xs:string?) as document-node()?

6.4.4.5 kv-lob:put-text

Writes a key-value pair. The $value is encoded as UTF-8.

Signature

declare %kv-lob:put("text")
function kv-lob:put-text($key as xs:string, $value as xs:string)

6.4.4.6 kv-lob:put-xml

Writes a key/value pair. The $xml is serialized and encoded as UTF-8.

Signature

declare %kv-lob:put("xml")
function kv-lob:put-xml($key as xs:string, $document as node())

6.4.4.7 kv-lob:put-binxml

Puts a key/value pair. The $xml is encoded as XDK binary XML.

Signature

declare %kv-lob:put("binxml")
function kv-lob:put-binxml($key as xs:string, $document as node()

6.4.5 Custom Functions for Reading Values from Oracle NoSQL Database

You can use the following functions to read values from Oracle NoSQL Database. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading collections of NoSQL values must have one of the following signatures:

declare %kv:collection("text") [additional annotations] 
   function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as xs:string* external;

declare %kv:collection(["xml"|"binxml"|"tika"]) [additional annotations] 
   function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external;

declare %kv:collection("tika") [additional annotations]
   function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?, $contentType as xs:string?) as document-node()* external;

Annotations­

%kv:collection("method")

Declares the NoSQL Database collection function. Required.

The method parameter is one of the following values:

  • avroxml: Each value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."

  • binxml: Each value is read as XDK binary XML and returned as an XML document.

  • text: Each value is decoded using the character set specified by the %output:encoding annotation.

  • tika: Each value is parsed by Tika, and returned as a document node.

  • xml: Each value is parsed as XML, and returned as an XML document.

%kv:key("true" | "false")

Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the key.

The default setting is true when method is xml, avroxml, or binxml, and false when it is text. Text functions with this annotation set to true must be declared to return text()? instead of xs:string?. Atomic xs:string values are not associated with a document node, but text nodes are. For example:

declare %kv:collection("text") %kv:key("true")
   function local:col($parent-key as xs:string?) as text()* external;

When the key is returned, you can obtain its string representation by using the kv:key() function. For example:

for $value in local:col(...)
let $key := $value/kv:key()
return ...
%avro:schema-kv("schema-name")

Specifies the Avro reader schema. This annotation is valid only when method is avroxml. Optional.

The schema-name is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. The record value is mapped to the reader schema. For example, %avro:schema-kv("org.example.PersonRecord").

See Also:

For information about Avro schemas, the Oracle NoSQL Database Getting Started Guide at

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/schemaevolution.html

%output:encoding

Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM.

This annotation currently only applies to the text method. For XML files, the document's encoding declaration is used if it is available.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

Parameters

Parameter 1: $parent-key as xs:string?

Specifies the parent key whose child KV pairs are returned by the function. The major key path must be a partial path and the minor key path must be empty. An empty sequence results in fetching all keys in the store.

See Also:

For the format of the key, Oracle NoSQL Database Java Reference at

http://docs.oracle.com/cd/NOSQL/html/javadoc/oracle/kv/Key.html#toString

Parameter 2: $depth as xs:int?

Specifies whether parents, children, descendants, or a combination are returned. The following values are valid:

  • kv:depth-parent-and-descendants(): Selects the parents and all descendants.

  • kv:depth-children-only(): Selects only the immediately children, but not the parent.

  • kv:depth-descendants-only(): Selects all descendants, but not the parent.

  • kv:depth-parent-and-children(): Selects the parent and the immediate children.

An empty sequence implies kv:depth-parent-and-descendants().

This example selects all the descendants, but not the parent:

kv:collection-text("/parent/key", kv:depth-descendants-only(), ...
Parameter 3: $subRange as xs:string?

Specifies a subrange to further restrict the range under parentKey to the major path components. The format of the string is:

<startType>/<start>/<end>/<endType>

The startType and endType are either I for inclusive or E for exclusive.

The start and end are the starting and ending key strings.

If the range does not have a lower boundary, then omit the leading startType/start specification from the string representation. Similarly, if the range does not have an upper boundary, then omit the trailing end/endType specification. A KeyRange requires at least one boundary, thus at least one specification must appear in the string representation.

The kv:key-range function provides a convenient way to create a range string.

The value can also be the empty sequence.

The following examples are valid subrange specifications:

Example Description

I/alpha/beta/E

From alpha inclusive to beta exclusive

E//0123/I

From "" exclusive to 0123 inclusive

I/chi/

From chi inclusive to infinity

E//

From "" exclusive to infinity

/chi/E

From negative infinity to chi exclusive

//I

From negative infinity to "" inclusive

6.4.6 Custom Functions for Retrieving Single Values from Oracle NoSQL Database

The Oracle NoSQL Database adapter has get functions, which enable you to retrieve a single value from the database. Unlike collection functions, calls to get functions are not distributed across the cluster. When a get function is called, the value is retrieved by a single task.

Signature

Custom get functions must have one of the following signatures:

declare %kv:get("text") [additional annotations]
   function local:myFunctionName($key as xs:string) as xs:string? external;

declare %kv:get("avroxml") [additional annotations] 
   function local:myFunctionName($key as xs:string) as element()? external;

declare %kv:get(["xml"|"binxml"|"tika"]) [additional annotations]
   function local:myFunctionName($key as xs:string) as document-node()?

declare %kv:get(["tika"]) [additional annotations]
   function local:myFunctionName($key as xs:string $contentType as xs:string?) as document-node()?

Annotations

%kv:get("method")

Declares the NoSQL Database get function. Required.

The method parameter is one of the following values:

  • avroxml: The value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."

  • binxml: The value is read as XDK binary XML and returned as an XML document.

  • text: The value is decoded using the character set specified by the %output:encoding annotation.

  • tika: Each value is parsed by Tika, and returned as a document node.

  • xml: The value is parsed as XML and returned as an XML document.

%kv:key("true" | "false")

Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the key.

The default setting is true when method is xml, avroxml, or binxml, and false when it is text. Text functions with this annotation set to true must be declared to return text()? instead of xs:string?. Atomic xs:string values are not associated with a document node, but text nodes are.

When the key is returned, you can obtain its string representation by using the kv:key() function.

%avro:schema-kv("schema-name")

Specifies the Avro reader schema. This annotation is valid only when method is avroxml. Optional.

The schema-name is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. The record value is mapped to the reader schema. For example, %avro:schema-kv("org.example.PersonRecord").

See Also:

For information about Avro schemas, the Oracle NoSQL Database Getting Started Guide at

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/schemaevolution.html

%output:encoding

Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM.

This annotation currently only applies to the text method. For XML files, the document encoding declaration is used, if it is available.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

6.4.7 Custom Functions for Reading Values from Oracle NoSQL Database using Table API

You can use the following functions to read values from Oracle NoSQL Database using Table API. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading collections of NoSQL values using Table API must have one of the following signatures:

declare %kv-table:collection(“jsontext")
function local:myFunctionName($tableName as xs:string) as xs:string* external;
declare %kv-table:collection(“jsontext")
function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string* external;
declare %kv-table:collection(“jsontext")
function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?, $fieldRangeJsonValue as xs:string?) as xs:string* external;

Annotations­

%kv-table:collection(“jsontext")

Declares the collection function that uses Table API.

Note:

jsontext is the only supported and required annotation value.

Parameters

Same as "Parameters."

Returns

Same as "Returns."

6.4.8 Custom Functions for Reading Single Row from Oracle NoSQL Database using Table API

You can use the following functions to read single row from Oracle NoSQL Database using Table API. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions to read single row from Oracle NoSQL Database using Table API must have one of the following signatures:

declare %kv-table:get(“jsontext")
function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string? external;

Annotations­

%kv-table:get(“jsontext")

Declares the get function that uses Table API.

Note:

jsontext is the only supported and required annotation value.

Parameters

Same as "Parameters."

Returns

Same as "Returns."

6.4.9 Custom Functions for Retrieving Single Values from Oracle NoSQL Database using Large Object API

You can use the following functions to read values from Oracle NoSQL Database using Large Object API. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading single values using Large Object API must have one of the following signatures:

declare %kv-lob:get("text") [additional annotations]
function local:myFunctionName($key as xs:string) as xs:string? external;
declare %kv-lob:get(["xml"|"binxml"|"tika"]) [additional annotations]
function local:myFunctionName($key as xs:string) as document-node()?
declare %kv-lob:get(["tika"]) [additional annotations]
function local:myFunctionName($key as xs:string $contentType as xs:string?) as document-node()?

Annotations­

%kv-lob:get(“method")

Declares the NoSQL Database get function that uses Large Object API. Required. Supported method parameters are binxml, text, tika, and xml – same as in %kv:get(“method").

Note:

avroxml method is not supported with Large Object API.

%kv-lob:key("true" | "false")

Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the key. Same as %kv:key().

%output:encoding

Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM. This annotation currently only applies to the text method. For XML files, the document encoding declaration is used, if it is available.

6.4.10 Custom Functions for Writing to Oracle NoSQL Database

You can use the following annotations to define functions that write to Oracle NoSQL Database.

Signature

Custom functions for writing to Oracle NoSQL Database must have one of the following signatures:

declare %kv:put("text") function
   local:myFunctionName($key as xs:string, $value as xs:string) external;

declare %kv:put(["xml"|"binxml"|"avroxml"]) function 
   local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations

Annotation Description

%kv:put("method")

Declares the NoSQL Database module put function. Required.

The method determines how the value is stored. It must be one of the following values:

  • text: $value is serialized and encoded using the character set specified by the %output:encoding annotation.

  • avroxml: $xml is mapped to an instance of the Avro record specified by the %avro:schema-kv annotation. See "Writing XML as Avro."

  • binxml: $xml is encoded as XDK binary XML.

  • xml: $xml is serialized and encoded using the character set specified by the %output:encoding annotation. You can specify other XML serialization parameters using %output:*.

%avro:schema-kv("schema-name")

Specifies the record schema of the values to be written. The annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog.

For example: %avro:schema-kv("org.example.PersonRecord")

%output:*

A standard XQuery serialization parameter for the output method (text or XML) specified in %kv:put. See "Serialization Annotations."

6.4.11 Custom Functions for Writing Values to Oracle NoSQL Database using Table API

You can use the following annotations to define functions that write to Oracle NoSQL Database using Table API.

Signature

Custom functions for writing rows using Table API must have one of the following signatures:

declare %kv-table:put(“jsontext")
function local:myFunctionName($tableName as xs:string, $jsonValue as xs:string?) external;

Annotations­

%kv-table:put(“jsontext")

Declares the put function that uses Table API.

Note:

jsontext is the only supported and required annotation value.

Parameters

Same as "Parameters."

6.4.12 Custom Functions for Writing Values to Oracle NoSQL Database using Large Object API

You can use the following annotations to define functions that write to Oracle NoSQL Database using Large Object API.

Signature

Custom functions for writing values using Large Object API must have one of the following signatures:

declare %kv-lob:put("text")
function  local:myFunctionName($key as xs:string, $value as xs:string) external;
declare %kv-lob:put(["xml"|"binxml"]) 
function local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations­

%kv-lob:put("method")

Declares the NoSQL Database put function. Required. Supported method parameters are binxml, text, and xml – same as in "%kv:put("method")"

Note:

avroxml method is not supported with Large Object API.

%output:*

A standard XQuery serialization parameter for the output method (text or XML) specified in %kv-lob:put. See "Serialization Annotations."

6.4.13 Examples of Oracle NoSQL Database Adapter Functions

Example 6-8 Writing and Reading Text in Oracle NoSQL Database

This example uses the following text file is in HDFS. The file contains user profile information such as user ID, full name, and age, separated by colons (:).

mydata/users.txt  
 
john:John Doe:45
kelly:Kelly Johnson:32
laura:Laura Smith:
phil:Phil Johnson:27

The first query stores the lines of this text file in Oracle NoSQL Database as text values.

import module "oxh:text";
import module "oxh:kv";
 
for $line in text:collection("mydata/users.txt")
let $split := fn:tokenize($line, ":")
let $key := "/users/text/" || $split[1]
return
   kv:put-text($key, $line)

The next query reads the values from the database:

import module "oxh:text";
import module "oxh:kv";

for $value in kv:collection-text("/users/text")
let $split := fn:tokenize($value, ":")
where $split[2] eq "Phil Johnson"
return
   text:put($value)

The query creates a text file that contains the following line:

phil:Phil Johnson:27

Example 6-9 Writing and Reading Avro in Oracle NoSQL Database

In this example, the following Avro schema is registered with Oracle NoSQL Database:

{
   "type": "record",
   "name": "User",
   "namespace": "com.example",
   "fields" : [
      {"name": "id", "type": "string"},
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
 }

The next query writes the user names to the database as Avro records.

import module "oxh:text";

declare %kv:put("avroxml") %avro:schema-kv("com.example.User")
   function local:put-user($key as xs:string, $value as node()) external;

for $line in text:collection("mydata/users.txt")
let $split := fn:tokenize($line, ":")
let $id := $split[1]
let $key := "/users/avro/" || $id
return
   local:put-user(
      $key,
      <user>
         <id>{$id}</id>
         <full_name>{$split[2]}</full_name>
         {
            if ($split[3] castable as xs:int) then
               <age>{$split[3]}</age>
            else
               ()
         }
      </user>
   )

This query reads the values from the database:

import module "oxh:text";
import module "oxh:kv";

for $user in kv:collection-avroxml("/users/avro")
where $user/age gt 30
return
   text:put($user/full_name)

The query creates a text files with the following lines:

John Doe
Kelly Johnson

Example 6-10 Storing XML in NoSQL Database

The following query uses the XML files shown in Example 6-24 of "Examples of XML File Adapter Functions" as input. It writes each comment element as an Oracle NoSQL Database value:

import module "oxh:xmlf";
import module "oxh:kv";
 
for $comment in xmlf:collection("mydata/comments*.xml")/comments/comment
let $key := "/comments/" || $comment/@id
return 
   kv:put-xml($key, $comment)
 

The query writes the five comment elements as XML values in Oracle NoSQL Database.

For very large XML files, modify the query as follows to improve performance and disk space consumption:

  • Use the following for clause, which causes each XML file to be split and processed in parallel by multiple tasks:

    for $comment in xmlf:collection("mydata/comments*.xml", "comment")
    
  • In the return clause, use kv:put-binxml instead of kv:put-xml to store the values as binary XML instead of plain text.

Use the kv:collection-xml function to read the values in the database. For example:

import module "oxh:text";
import module "oxh:kv";

for $comment in kv:collection-xml("/comments")/comment
return 
   text:put($comment/@id || " " || $comment/@user)
 

The query creates text files that contain the following lines:

12345 john
23456 john
54321 mike
56789 kelly
87654 mike

Example 6-11 Storing XML as Avro in Oracle NoSQL Database

This example converts the XML values to Avro before they are stored.

Add the following Avro schema to Oracle NoSQL Database:

{
   "type": "record",
   "name": "Comment",
   "namespace": "com.example",
   "fields" : [
      {"name": "cid", "type": "string"},
      {"name": "user", "type": "string"},
      {"name": "content", "type": "string"},
      {"name": "likes", "type" : { "type" : "array", "items" : "string" } }
   ]
}

The following query writes five comment elements as Avro values in Oracle NoSQL Database:

import module "oxh:xmlf";
import module "oxh:kv";

declare %kv:put("avroxml") %avro:schema-kv("com.example.Comment")
  function local:put-comment($key as xs:string, $value as node()) external;

for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $key := "/comments/" || $comment/@id
let $value :=
  <comment>
    <cid>{$comment/@id/data()}</cid>
    <user>{$comment/@user/data()}</user>
    <content>{$comment/@text/data()}</content>
    <likes>{
      for $like in $comment/like
      return <oxh:item>{$like/@user/data()}</oxh:item>
    }</likes>
  </comment>
return 
  local:put-comment($key, $value)

Use the kv:collection-avroxml function to read the values in the database. For example:

import module "oxh:text";
import module "oxh:kv";

for $comment in kv:collection-avroxml("/comments")
return 
   text:put($comment/cid || " " || $comment/user || " " || count($comment/likes/*))

The query creates text files that contain the following lines:

12345 john 0
23456 john 2
54321 mike 1
56789 kelly 2
87654 mike 0

Example 6-12 Reading and writing data using Oracle NoSQL Database Table API

This example uses the following text file is in HDFS. The file contains user profile information such as user ID, full name, and age, separated by colons (:).

mydata/users.txt  
john:John Doe:45
kelly:Kelly Johnson:32
laura:Laura Smith:
phil:Phil Johnson:27

Let us create a table called users in NoSQL DB as follows:

CREATE TABLE users (id STRING, name STRING, age INTEGER, PRIMARY KEY (id));

The first query stores users age into this table.

import module "oxh:text";
import module "oxh:kv-table";

for $line in text:collection("mydata/users.txt")
let $split := tokenize($line, ":")
let $id := $split[1]
let $name := $split[2]
let $age := $split[3]
where string-length($age) gt 0
let $row := 
'{' ||
  '"id":"' || $id || '",' ||
  '"name":"' || $name || '",' ||
  '"age":' || $age ||
'}'

return
   kv-table:put-jsontext(“users", $row)

After running this query the table contains the following records:

Id name age

john

John Doe

45

kelly

Kelly Johnson

32

phil

Phil Johnson

27

The second query reads row from the table and returns ids of users whose name ends with Johnson.

import module "oxh:text ";
import module "oxh:json";
import module "oxh:kv-table";
 
for $row in kv-table:collection("users")
let $user := json:parse-as-xml($row)
let $id := $user/json:get(“id")
let $name := $user/json:get(“name")
where ends-with($name, “Johnson")
 
return text:put($id)

The query creates a text file that contains the following lines:

kelly
phil

Example 6-13 Reading data using Oracle NoSQL Database Large Object API

Assuming Oracle NoSQL Database contains the following information:

  1. Table userImages

    CREATE TABLE userImages (imageFileName STRING, imageVersion STRING, imageDescription INTEGER, PRIMARY KEY (imageFileName))

    imageFileName imageVersion imageDescription

    IMG_001.JPG

    1

    Sunrise

    IMG_002.JPG

    1

    Sunrise

  2. Key/Value data loaded with Large Object API where:

    • Key is the lob/imageFileName/image.lob

    • Value is a JPEG image data that contains geolocation metadata in EXIF format

    The following query extracts that metadata and converts it to CSV format as imageFileName, latitude, and longitude.

    import module “oxh:kv-table";
    import module “oxh:kv-lob";
    import module "oxh:tika";
    import module "oxh:json";
    import module "oxh:text ";
     
    for $row in kv-table:collection("userImages")
     
    let $imageFileName := json:parse-as-xml($row)/json:get(“imageFileName")
    let $imageKey := “lob/" || $imageFileName || “/image.lob"
    let $doc := kv-lob:get-tika($imageKey, “image/jpeg")
    let $lat := $doc/tika:metadata/tika:property[@name eq "GPS Latitude"]
    let $lon := $doc/tika:metadata/tika:property[@name eq "GPS Longitude"]
    where exists($lat) and exists($lon)
     
    return text:put($imageFileName || "," || $lat || "," || $lon)
    

6.4.14 Oracle NoSQL Database Adapter Configuration Properties

Oracle XQuery for Hadoop uses the generic options for specifying configuration properties in the Hadoop command. You can use the -conf option to identify configuration files, and the -D option to specify individual properties. See "Running Queries."

You can set various configuration properties for the Oracle NoSQL Database adapter that control the durability characteristics and timeout periods. You must set oracle.kv.hosts and oracle.kv.kvstore.The following properties configure the Oracle NoSQL Database adapter.

Property Description

oracle.hadoop.xquery.kv.config.durability

Type: String

Default Value: NO_SYNC, NO_SYNC, SIMPLE_MAJORITY

Description: Defines the durability characteristics associated with %kv:put operations. The value consists of three parts, which you specify in order and separate with commas (,):

MasterPolicy, ReplicaPolicy, ReplicaAck

  • MasterPolicy: The synchronization policy used when committing a transaction to the master database. Set this part to one of the following constants:

    NO_SYNC: Do not write or synchronously flush the log on a transaction commit.

    SYNC: Write and synchronously flush the log on a transaction commit.

    WRITE_NO_SYNC: Write but do not synchronously flush the log on a transaction commit.

  • ReplicaPolicy: The synchronization policy used when committing a transaction to the replica databases. Set this part to NO_SYNC, SYNC, or WRITE_NO_SYNC, as described under MasterPolicy.

  • ReplicaAck: The acknowledgment policy used to obtain transaction acknowledgments from the replica databases. Set this part to one of the following constants:

    ALL: All replicas must acknowledge that they have committed the transaction.

    NONE: No transaction commit acknowledgments are required, and the master does not wait for them.

    SIMPLE_MAJORITY: A simple majority of replicas (such as 3 of 5) must acknowledge that they have committed the transaction.

oracle.hadoop.xquery.kv.config.requestLimit

Type: Comma-separated list of integers

Default Value: 100, 90, 80

Description: Limits the number of simultaneous requests to prevent nodes with long service times from consuming all threads in the KV store client. The value consists of three integers, which you specify in order and separate with commas:

maxActiveRequests, requestThresholdPercent, nodeLimitPercent

  • maxActiveRequests: The maximum number of active requests permitted by the KV client. This number is typically derived from the maximum number of threads that the client has set aside for processing requests.

  • requestThresholdPercent: The percentage of maxActiveRequests at which requests are limited.

  • nodeLimitPercent: The maximum number of active requests that can be associated with a node when the number of active requests exceeds the threshold specified by requestThresholdPercent.

oracle.hadoop.xquery.kv.config.requestTimeout

Type: Long

Default Value: 5000 ms

Description: Configures the request timeout period in milliseconds. The value must be greater than zero (0).

oracle.hadoop.xquery.kv.config.socketOpenTimeout

Type: Long

Default Value: 5000 ms

Description: Configures the open timeout used when establishing sockets for client requests, in milliseconds. Shorter timeouts result in more rapid failure detection and recovery. The default open timeout is adequate for most applications. The value must be greater than zero (0).

oracle.hadoop.xquery.kv.config.socketReadTimeout

Type: Long

Default Value: 30000 ms

Description: Configures the read timeout period associated with the sockets that make client requests, in milliseconds. Shorter timeouts result in more rapid failure detection and recovery. Nonetheless, the timeout period should be sufficient to allow the longest timeout associated with a request.

oracle.kv.batchSize

Type: Key

Default Value: Not defined

Description: The desired number of keys for the InputFormat to fetch during each network round trip. A value of zero (0) sets the property to a default value.

oracle.kv.consistency

Type: Consistency

Default Value: NONE_REQUIRED

Description: The consistency guarantee for reading child key-value pairs. The following keywords are valid values:

  • ABSOLUTE: Requires the master to service the transaction so that consistency is absolute.

  • NONE_REQUIRED: Allows replicas to service the transaction, regardless of the state of the replicas relative to the master.

oracle.kv.hosts

Type: String

Default Value: Not defined

Description: An array of one or more hostname:port pairs that identify the hosts in the KV store with the source data. Separate multiple pairs with commas.

oracle.kv.kvstore

Type: String

Default Value: Not defined

Description: The name of the KV store with the source data.

oracle.kv.timeout

Type: Long

Default Value: Not defined

Description: Sets a maximum time interval in milliseconds for retrieving a selection of key-value pairs. A value of zero (0) sets the property to its default value.

oracle.hadoop.xquery.kv.config.LOBSuffix

Type: String

Default Value: .lob

Description: Configures the default suffix associated with LOB keys.

oracle.hadoop.xquery.kv.config.LOBTimeout

Necessary or FYI? Also, hard-coded link.

oracle.hadoop.xquery.kv.config.readZones

Type: Comma separated list of strings

Default Value: Not defined

Description: Sets the zones in which nodes must be located to be used for read operations.

oracle.hadoop.xquery.kv.config.security

Type: String

Default Value: Not defined

Description: Configures security properties for the client.

6.5 Sequence File Adapter

The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.

The functions are described in the following topics:

6.5.1 Built-in Functions for Reading and Writing Sequence Files

To use the built-in functions in your query, you must import the sequence file module as follows:

import module "oxh:seq";

The sequence file module contains the following functions:

For examples, see "Examples of Sequence File Adapter Functions."

6.5.1.1 seq:collection

Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("text") function 
   seq:collection($uris as xs:string*) as xs:string* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.

Returns

One string for each value in each file.

6.5.1.2 seq:collection-xml

Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("xml") function 
   seq:collection-xml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For BytesWritable values, the XML document encoding declaration is used, if it is available.

Returns

One XML document for each value in each file. See "Tika Parser Output Format."

6.5.1.3 seq:collection-binxml

Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.

Signature

declare %seq:collection("binxml") function 
   seq:collection-binxml($uris as xs:string*) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The bytes are decoded as binary XML.

Returns

One XML document for each value in each file.

Notes

You can use this function to read files that were created by seq:put-binxml in a previous query. See "seq:put-binxml."

6.5.1.4 seq:collection-tika

Uses Tika to parse the sequence files in the HDFS. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. For each value a document node returned produced by Tika.

Signature

declare %seq:collection("tika") function 
   seq:collection-tika($uris as xs:string*) as document-node()* external;
declare %seq:collection("tika") function 
   seq:collection-tika($uris as xs:string*, $contentType as xs:string?) as document-node()* external;

Parameters

$uris: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. Tika library automatically detects character encoding. Alternatively, the encoding can be passed in $contentType parameter as charset attribute.

$contentType: Specifies the media type of the content to parse, and may have the charset attribute.

Returns

One document node for each value in each file.

6.5.1.5 seq:put

Writes either the string value or both the key and string value of a key-value pair to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.Text and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("text") function
   seq:put($key as xs:string, $value as xs:string) external;

declare %seq:put("text") function 
   seq:put($value as xs:string) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

6.5.1.6 seq:put-xml

Writes either an XML value or a key and XML value to a sequence file in the output directory of the query.

This function writes the keys and values as org.apache.hadoop.io.Text.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.Text and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("xml") function
   seq:put-xml($key as xs:string, $xml as node()) external;

declare %seq:put("xml") function 
   seq:put-xml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

6.5.1.7 seq:put-binxml

Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.

This function writes the keys as org.apache.hadoop.io.Text and the values as org.apache.hadoop.io.BytesWritable.

When the function is called without the $key parameter, it writes the values as org.apache.hadoop.io.BytesWritable and sets the key class to org.apache.hadoop.io.NullWritable, because there are no key values.

Signature

declare %seq:put("binxml") function
   seq:put-binxml($key as xs:string, $xml as node()) external;

declare %seq:put("binxml") function 
   seq:put-binxml($xml as node()) external;

Parameters

$key: The key of a key-value pair

$value: The value of a key-value pair

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

You can use the seq:collection-binxml function to read the files created by this function. See "seq:collection-binxml."

6.5.2 Custom Functions for Reading Sequence Files

You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading sequence files must have one of the following signatures:

declare %seq:collection("text") [additional annotations] 
   function local:myFunctionName($uris as xs:string*) as xs:string* external;

declare %seq:collection(["xml"|"binxml"|"tika"]) [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;
declare %seq:collection(["tika"]) [additional annotations]
   function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external;

Annotations

%seq:collection(["method"])

Declares the sequence file collection function, which reads sequence files. Required.

The optional method parameter can be one of the following values:

  • text: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. Bytes are decoded using the character set specified by the %output:encoding annotation. They are returned as xs:string. Default.

  • xml: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. The values are parsed as XML and returned by the function.

  • binxml: The values in the sequence files must be org.apache.hadoop.io.BytesWritable. The values are read as XDK binary XML and returned by the function.

  • tika: The values in the sequence files must be either org.apache.hadoop.io.Text or org.apache.hadoop.io.BytesWritable. The values are parsed by Tika and returned by the function.

%output:encoding("charset")

Specifies the character encoding of the input values. The valid encodings are those supported by the JVM. UTF-8 is the default encoding.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

%seq:key("true" | "false")

Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the keys. The default setting is true when method is binxml or xml, and false when it is text.

Text functions with this annotation set to true must return text()* instead of xs:string* because atomic xs:string is not associated with a document.

When the keys are returned, you can obtain their string representations by using seq:key function.

This example returns text instead of string values because %seq:key is set to true.

declare %seq:collection("text") %seq:key("true")
   function local:col($uris as xs:string*) as text()* external;

The next example uses the seq:key function to obtain the string representations of the keys:

for $value in local:col(...)
let $key := $value/seq:key()
return 
   .
   .
   .
%seq:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%seq:split-max(1024)
%seq:split-max("1024")
%seq:split-max("1K")
%seq:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%seq:split-min(1024)
%seq:split-min("1024")
%seq:split-min("1K")

6.5.3 Custom Functions for Writing Sequence Files

You can use the following annotations to define functions that write collections of sequence files in HDFS.

Signature

Custom functions for writing sequence files must have one of the following signatures. You can omit the $key argument when you are not writing a key value.

declare %seq:put("text") [additional annotations] 
   function local:myFunctionName($key as xs:string, $value as xs:string) external;

declare %seq:put(["xml"|"binxml"]) [additional annotations] 
   function local:myFunctionName($key as xs:string, $xml as node()) external;

Annotations

%seq:put("method")

Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.

If you use the $key argument in the signature, then the key is written as org.apache.hadoop.io.Text. If you omit the $key argument, then the key class is set to org.apache.hadoop.io.NullWritable.

Set the method parameter to text, xml, or binxml. The method determines the type used to write the value:

  • text: String written as org.apache.hadoop.io.Text

  • xml: XML written as org.apache.hadoop.io.Text

  • binxml: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable

%seq:compress("codec", "compressionType")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

  1. The fully qualified class name of the codec

  2. The unqualified class name of the codec

  3. The prefix of the unqualified class name before Codec (case insensitive)

Set the compressionType parameter to one of these values:

  • block: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.

  • record: Only the values in the sequence file are compressed.

All of these examples use the default codec and block compression:

%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%seq:compress("DefaultCodec", "block")
%seq:compress("default", "block")
%seq:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put. See "Serialization Annotations."

See Also:

SequenceFile at Apache’s Hadoop Wiki.

"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization 3.1

6.5.4 Examples of Sequence File Adapter Functions

These examples queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>

Example 6-14

The following query stores the comment elements in sequence files.

import module "oxh:seq";
import module "oxh:xmlf";
 
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
return 
   seq:put-xml($comment)

Example 6-15

The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput. The query then writes the names of users who made multiple comments to a text file.

import module "oxh:seq";
import module "oxh:text";

for $comment in seq:collection-xml("myoutput/part*")/comment
let $user := $comment/@user
group by $user
let $count := count($comment)
where $count gt 1
return
   text:put($user || " " || $count)

The text file created by the previous query contain the following lines:

john 2
mike 2

See "XML File Adapter."

Example 6-16

The following query extracts comment elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id attribute and uses the value as the key in the sequence files.

import module "oxh:xmlf";

declare 
   %seq:put("xml")
   %seq:compress("default", "block") 
   %seq:file("comments")
function local:myPut($key as xs:string, $value as node()) external;    
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $id := $comment/@id
let $newComment := 
   copy $c := $comment 
   modify delete node $c/@id
   return $c
return
   local:myPut($id, $newComment)

Example 6-17

The next query reads the sequence files that the previous query created in an output directory named myoutput. The query automatically decompresses the sequence files.

import module "oxh:text";
import module "oxh:seq";

for $comment in seq:collection-xml("myoutput/comments*")/comment
let $id := $comment/seq:key()
where $id eq "12345"
return 
   text:put-xml($comment)
 

The query creates a text file that contains the following line:

<comment id="12345" user="john" text="It is raining :( "/>

6.6 Solr Adapter

This adapter provides functions to create full-text indexes and load them into Apache Solr servers. These functions call the Solr org.apache.solr.hadoop.MapReduceIndexerTool at run time to generate a full-text index on HDFS and optionally merge it into Solr servers. You can declare and use multiple custom put functions supplied by this adapter and the built-in put function within a single query. For example, you can load data into different Solr collections or into different Solr clusters.

This adapter is described in the following topics:

6.6.1 Prerequisites for Using the Solr Adapter

The first time that you use the Solr adapter, ensure that Solr is installed and configured on your Hadoop cluster as described in "Installing Oracle XQuery for Hadoop".

6.6.1.1 Configuration Settings

Your Oracle XQuery for Hadoop query must use the following configuration properties or the equivalent annotation:

  • oracle.hadoop.xquery.solr.loader.zk-host

  • oracle.hadoop.xquery.solr.loader.collection

If the index is loaded into a live set of Solr servers, then this configuration property or the equivalent annotation is also required:

  • oracle.hadoop.xquery.solr.loader.go-live

You can set the configuration properties using either the -D or -conf options in the hadoop command when you run the query. See "Running Queries" and "Solr Adapter Configuration Properties"

6.6.1.2 Example Query Using the Solr Adapter

This example sets OXH_SOLR_MR_HOME and uses the hadoop -D option in a query to set the configuration properties:

$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr 
$ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.hadoop.xquery.solr.loader.zk-host=/solr -D oracle.hadoop.xquery.solr.loader.collection=collection1 -D oracle.hadoop.xquery.solr.loader.go-live=true  ./myquery.xq -output ./myoutput

6.6.2 Built-in Functions for Loading Data into Solr Servers

To use the built-in functions in your query, you must import the Solr module as follows:

import module "oxh:solr";

The Solr module contains the following functions:

The solr prefix is bound to the oxh:solr namespace by default.

6.6.2.1 solr:put

Writes a single document to the Solr index.

This document XML format is specified by Solr at

https://wiki.apache.org/solr/UpdateXmlMessages

Signature

declare %solr:put function
   solr:put($value as element(doc)) external;

Parameters

$value: A single XML element named doc, which contains one or more field elements, as shown here:

<doc>
<field name="field_name_1">field_value_1</field>
     .
     .
     .
<field name="field_name_N">field_value_N</field>
</doc>

Returns

A generated index that is written into the output_dir/solr-put directory, where output_dir is the query output directory

6.6.3 Custom Functions for Loading Data into Solr Servers

You can use the following annotations to define functions that generate full-text indexes and load them into Solr.

Signature

Custom functions for generating Solr indexes must have the following signature:

declare %solr:put [additional annotations] 
   function local:myFunctionName($value as node()) external;

Annotations

%solr:put

Declares the solr put function. Required.

%solr:file(directory_name)

Name of the subdirectory under the query output directory where the index files will be written. Optional, the default value is the function local name.

%solr-property:property_name(value)

Controls various aspects of index generation. You can specify multiple %solr-property annotations.

These annotations correspond to the command-line options of org.apache.solr.hadoop.MapReduceIndexerTool. Each MapReduceIndexerTool? option has an equivalent Oracle XQuery for Hadoop configuration property and a %solr-property annotation. Annotations take precedence over configuration properties. See "Solr Adapter Configuration Properties" for more information about supported configuration properties and the corresponding annotations.

See Also:

For more information about MapReduceIndexerTool? command line options, see Cloudera Search User Guide at

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html

Parameters

$value: An element or a document node conforming to the Solr XML syntax. See "solr:put" for details.

6.6.4 Examples of Solr Adapter Functions

Example 6-18 Using the Built-in solr:put Function

This example uses the following HDFS text file. The file contains user profile information such as user ID, full name, and age, separated by colons (:).

mydata/users.txt
john:John Doe:45 
kelly:Kelly Johnson:32
laura:Laura Smith: 
phil:Phil Johnson:27

The first query creates a full-text index searchable by name.

import module "oxh:text";
import module "oxh:solr";
for $line in text:collection("mydata/users.txt") 
let $split := fn:tokenize($line, ":") 
let $id := $split[1]
let $name := $split[2]
return solr:put(
<doc>
<field name="id">{ $id }</field>
<field name="name">{ $name }</field>
</doc>
)

The second query accomplishes the same result, but uses a custom put function. It also defines all configuration parameters by using function annotations. Thus, setting configuration properties is not required when running this query.

import module "oxh:text";
declare %solr:put %solr-property:go-live %solr-property:zk-host("/solr") %solr-property:collection("collection1") 
function local:my-solr-put($doc as element(doc)) external;
for $line in text:collection("mydata/users.txt") 
let $split := fn:tokenize($line, ":") 
let $id := $split[1]
let $name := $split[2]
return local:my-solr-put(
<doc>
<field name="id">{ $id }</field>
<field name="name">{ $name }</field>
</doc>
)

6.6.5 Solr Adapter Configuration Properties

The Solr adapter configuration properties correspond to the Solr MapReduceIndexerTool options.

MapReduceIndexerTool is a MapReduce batch job driver that creates Solr index shards from input files, and writes the indexes into HDFS. It also supports merging the output shards into live Solr servers, typically a SolrCloud.

You can specify these properties with the generic -conf and -D hadoop command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Solr adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop command-line options.

Alternatively, you can specify these properties as Solr adapter put function annotations with the %solr-property prefix. These annotations are identified in the property descriptions. Annotations apply only to the particular Solr adapter put function that contains them in its declaration.

See Also:

For discussions about how Solr uses the MapReduceIndexerTool options, see the Cloudera Search User Guide at

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html

Property Overview

oracle.hadoop.xquery.solr.loader.collection

Type: String

Default Value: Not defined

Equivalent Annotation: %solr-property:collection

Description: The SolrCloud collection for merging the index, such as mycollection. Use this property with oracle.hadoop.xquery.solr.loader.go-live and oracle.hadoop.xquery.solr.loader.zk-host. Required as either a property or an annotation.

oracle.hadoop.xquery.solr.loader.fair-scheduler-pool

Type: String

Default Value: Not defined

Equivalent Annotation:%solr-property:fair-scheduler-pool

Description: The name of the fair scheduler pool for submitting jobs. The job runs using fair scheduling instead of the default Hadoop scheduling method. Optional.

oracle.hadoop.xquery.solr.loader.go-live

Type: String values true or false

Default Value: false

Equivalent Annotation: %solr-property:go-live

Description: Set to true to enable the final index to merge into a live Solr cluster. Use this property with oracle.hadoop.xquery.solr.loader.collection and oracle.hadoop.xquery.solr.loader.zk-host. Optional.

oracle.hadoop.xquery.solr.loader.go-live-threads

Type: Integer

Default Value: 1000

Equivalent Annotation: %solr-property:go-live-threads

Description: The maximum number of live merges that can run in parallel. Optional.

oracle.hadoop.xquery.solr.loader.log4j

Type: String

Default Value:

Equivalent Annotation: %solr-property:log4j

Description: The relative or absolute path to the log4j.properties configuration file on the local file system For example, /path/to/log4j.properties. Optional.

This file is uploaded for each MapReduce task.

oracle.hadoop.xquery.solr.loader.mappers

Type: String

Default Value: -1

Equivalent Annotation: %solr-property:mappers

Description: The maximum number of mapper tasks that Solr uses. A value of -1 enables the use of all map slots available on the cluster.

oracle.hadoop.xquery.solr.loader.max-segments

Type: String

Default Value: 1

Equivalent Annotation: %solr-property:max-segments

Description: The maximum number of segments in the index generated by each reducer.

oracle.hadoop.xquery.solr.loader.reducers

Type: String

Default Value: -1

Equivalent Annotation: %solr-property:reducers

Description: The number of reducers to use:

  • -1: Uses all reduce slots available on the cluster.

  • -2: Uses one reducer for each Solr output shard. This setting disables the MapReduce M-tree merge algorithm, which typically improves scalability.

oracle.hadoop.xquery.solr.loader.zk-host

Type: String

Default Value: Not defined

Equivalent Annotation: %solr-property:zk-host

Description: The address of a ZooKeeper ensemble used by the SolrCloud cluster. Specify the address as a list of comma-separated host:port pairs, each corresponding to a ZooKeeper server. For example, 127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183/solr. Optional.

If the address starts with a slash (/), such as /solr, then Oracle XQuery for Hadoop automatically prefixes the address with the ZooKeeper connection string.

This property enables Solr to determine the number of output shards to create and the Solr URLs in which to merge them. Use this property with oracle.hadoop.xquery.solr.loader.collection and oracle.hadoop.xquery.solr.loader.golive. Required as either a property or an annotation.

6.7 Text File Adapter

The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:

6.7.1 Built-in Functions for Reading and Writing Text Files

To use the built-in functions in your query, you must import the text file module as follows:

import module "oxh:text";

The text file module contains the following functions:

For examples, see "Examples of Text File Adapter Functions."

6.7.1.1 text:collection

Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.

The files might be split up and processed in parallel by multiple tasks.

Signature

declare %text:collection("text") function
   text:collection($uris as xs:string*) as xs:string* external;

declare %text:collection("text") function
   function text:collection($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;

Parameters

$uris: The text file URIs.

$delimiter: A custom delimiter on which the file is split. The default is the newline character.

Returns

One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file

6.7.1.2 text:collection-xml

Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.

The files might be split up and processed in parallel by multiple tasks. Each delimited section of each file is parsed as an XML document and returned by the function. Therefore, each segment must fully contain a single XML document, and any delimit characters in the XML must be escaped with XML character references. By default, the delimiter is a new line.

Signature

declare %text:collection("xml") function 
   text:collection-xml($uris as xs:string*) as document-node()* external;

declare %text:collection("xml") function 
   text:collection-xml($uris as xs:string*, $delimiter as xs:string?) as document-node()* external;

Parameters

$uris: The text file URIs.

$delimiter: A custom delimiter on which the file is split. The default is the newline character.

Returns

One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file

6.7.1.3 text:put

Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.

Signature

declare %text:put("text") function
   text:put($value as xs:string) external;

Parameters

$value: The text to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

6.7.1.4 text:put-xml

Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.

Newline characters in the serialized XML are replaced with character references to ensure that the XML does not span multiple lines. For example, &#xA; replaces the linefeed character (\n).

Signature

declare %text:put("xml") function
   text:put-xml($value as node()) external;

Parameters

$value: The XML to write

Returns

empty-sequence()

Notes

The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part, such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."

6.7.1.5 text:trace

Writes a line to a text file named trace-* in the output directory of the query. The lines are spread across one or more files.

This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.

Signature

declare %text:put("text") %text:file("trace") function
   text:trace($value as xs:string) external;

Parameters

$value: The text to write

Returns

empty-sequence()

6.7.2 Custom Functions for Reading Text Files

You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

The input files can be compressed with a Hadoop-supported compression codec. They are automatically decompressed when read.

Signature

Custom functions for reading text files must have one of the following signatures:

declare %text:collection("text") [additional annotations]
   function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;
 
declare %text:collection("text") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as xs:string* external;
 
declare %text:collection("xml") [additional annotations]
   function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as document-node()* external

declare %text:collection("xml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%text:collection(["method"])

Declares the text collection function. Required.

The optional method parameter can be one of the following values:

  • text: Each line in the text file is returned as xs:string. Default.

  • xml: Each line in the text file is parsed as XML and returned as document-node. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.

%text:split("delimiter")

Specifies a custom delimiter for splitting the input files. The default delimiter is the newline character.

Do not combine this annotation with the $delimiter parameter. To specify a custom delimiter, use either this annotation or the $delimiter parameter.

%text:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%text:split-max(1024)
%text:split-max("1024")
%text:split-max("1K")
%text:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%text:split-min(1024)
%text:split-min("1024")
%text:split-min("1K")

Parameters

$uris as xs:string*

Lists the HDFS file URIs. The files can be uncompressed or compressed with a Hadoop-supported codec. Required.

$delimiter as xs:string?

A custom delimiter on which the input text files are split. The default delimiter is a new line. Do not combine this parameter with the %text:split annotation.

Returns

xs:string* for the text method

document-node()* for the xml method

6.7.3 Custom Functions for Writing Text Files

You can use the following annotations to define functions that write text files in HDFS.

Signature

Custom functions for writing text files must have one of the following signatures:

declare %text:put("text") [additional annotations] function 
   text:myFunctionName($value as xs:string) external;
 
declare %text:put("xml") [additional annotations] function 
   text:myFunctionName($value as node()) external;

Annotations

%text:put(["method"])

Declares the text put function. Required.

The optional method parameter can be one of the following values:

  • text: Writes data to a text file. Default.

  • xml: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This process ensures that the resulting XML document is one text line with no line breaks.

%text:compress("codec")

Specifies the compression format used on the output. The default is no compression. Optional.

The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:

  1. The fully qualified class name of the codec

  2. The unqualified class name of the codec

  3. The prefix of the unqualified class name before "Codec" (case insensitive)

All of these examples use the default codec and block compression:

%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block")
%text:compress("DefaultCodec", "block")
%text:compress("default", "block") 
%text:file("name")

Specifies the output file name prefix. The default prefix is part.

%output:parameter

A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put. See "Serialization Annotations."

UTF-8 is currently the only supported character encoding.

6.7.4 Examples of Text File Adapter Functions

Example 6-19 Using Built-in Functions to Query Text Files

This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.

mydata/visits1.log  
 
2013-10-28T06:00:00, john, index.html, 200
2013-10-28T08:30:02, kelly, index.html, 200
2013-10-28T08:32:50, kelly, about.html, 200
2013-10-30T10:00:10, mike, index.html, 401

mydata/visits2.log  

2013-10-30T10:00:01, john, index.html, 200
2013-10-30T10:05:20, john, about.html, 200
2013-11-01T08:00:08, laura, index.html, 200
2013-11-04T06:12:51, kelly, index.html, 200
2013-11-04T06:12:40, kelly, contact.html, 200

The following query filters out the pages visited by john and writes only the date and page visited to a new text file:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
where $split[2] eq "john"
return
   text:put($split[1] || " " || $split[3])
 

This query creates a text file that contains the following lines:

2013-10-28T06:00:00 index.html
2013-10-30T10:00:01 index.html
2013-10-30T10:05:20 about.html

The next query computes the number of page visits per day:

import module "oxh:text";
 
for $line in text:collection("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $time := xs:dateTime($split[1])
let $day := xs:date($time)
group by $day
return
   text:put($day || " => " || count($line))
 

The query creates text files that contain the following lines:

2013-10-28 => 3
2013-10-30 => 3
2013-11-01 => 1
2013-11-04 => 2

Example 6-20 Querying Simple Delimited Formats

This example uses the fn:tokenize function to parse the lines of a text file. This technique works well for simple delimited formats.

The following query declares custom put and collection functions. It computes the number of hits and the number of unique users for each page in the logs.

import module "oxh:text";

declare
   %text:collection("text")
   %text:split-max("32m")
function local:col($uris as xs:string*) as xs:string* external;
 
declare
   %text:put("xml")
   %text:compress("gzip") 
   %text:file("pages")
function local:out($arg as node()) external;
 
for $line in local:col("mydata/visits*.log")
let $split := fn:tokenize($line, "\s*,\s*")
let $user := $split[2]
let $page := $split[3]
group by $page
return
   local:out(
      <page>
         <name>{$page}</name>
         <hits>{count($line)}</hits>
         <users>{fn:count(fn:distinct-values($user))}</users>
      </page>
   )

The output directory of the previous query is named myoutput. The following lines are written to myoutput/pages-r-*.gz.

<page><name>about.html</name><hits>2</hits><users>2</users></page>
<page><name>contact.html</name><hits>1</hits><users>1</users></page>
<page><name>index.html</name><hits>6</hits><users>4</users></page>

The files are compressed with the gzip codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.

import module "oxh:text";

for $page in text:collection-xml("myoutput/page*.gz")/page
return 
   text:put($page/name || "," || $page/hits)
 

This query creates text files that contain the following lines:

about.html,2
contact.html,1
index.html,6

Example 6-21 Querying Complex Text Formats

The fn:tokenize function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string function to process a log file in the Apache Common Log format.

A text file named mydata/access.log in HDFS contains the following lines:

192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284
192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567
192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284
192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390
192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
 

The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string to match the components of the log entries. The time format cannot be cast directly to xs:dateTime, as shown in Example 6-20. Instead, the ora-fn:dateTime-from-string-with-format function converts the string to an instance of xs:dateTime.

import module "oxh:text";

declare variable $REGEX :=
   '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)';

for $line in text:collection("mydata/access.log")
let $match := fn:analyze-string($line, $REGEX)/fn:match
let $time := 
   ora-fn:dateTime-from-string-with-format(
      "dd/MMM/yyyy:HH:mm:ss Z",
      $match/fn:group[4]
   )
let $status := $match/fn:group[6]
where 
   $status eq "404" and 
   $time ge xs:dateTime("2013-10-01T00:00:00")
let $host := $match/fn:group[1]
let $request := $match/fn:group[5]
return
   text:put($host || "," || $request)

The query creates text files that contain the following lines:

192.0.2.1,GET /indexx.html HTTP/1.1
192.0.2.1,GET /aobut.html HTTP/1.1

See Also:

6.8 Tika File Adapter

This adapter provides functions to parse files stored in HDFS in various formats using Apache Tika library. It is described in the following topics:

6.8.1 Built-in Library Functions for Parsing Files with Tika

To use the built-in functions in your query, you must import the Tika file module as follows:

import module "oxh:tika";

The Tika file module contains the following functions:

For examples, see "Examples of Tika File Adapter Functions."

6.8.1.1 tika:collection

Parses files stored in HDFS in various formats and extracts the content or metadata from them.

Signature

declare %tika:collection function
   tika:collection($uris as xs:string*) as document-node()* external;

declare %tika:collection function
   function tika:collection($uris as xs:string*, $contentType as xs:string?) as document-node()* external;

Parameters

$uris: The HDFS file URIs.

$contentType: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.

Returns

Returns a document node for each value. See "Tika Parser Output Format".

6.8.1.2 tika:parse

Parses the data given to it as an argument.For example, it can parse an html fragment within an XML or JSON document.

Signature

declare function
   tika:parse($data as xs:string?, $contentType as xs:string?) as document-node()* external;

Parameters

$data: The value to be parsed.

$contentType: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.

Returns

Returns a document node for each value. See "Tika Parser Output Format".

6.8.2 Custom Functions for Parsing Files with Tika

You can use the following annotations to define functions to parse files in HDFS with Tika. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading HDFS files must have one of the following signatures:

declare %tika:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
declare %tika:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%tika:collection(["method"])

Identifies an external function to be implemented by Tika file adapter. Required.

The optional method parameter can be one of the following values:

  • tika: Each line in the tika file is returned as document-node(). Default.

%output:media-type

Declares the file content type. It is a MIME type and must not have the charset attribute as per XQuery specifications. Optional.

%output:encoding

Declares the file character set. Optional.

Note:

%output:media-type and %output:econding annotations specify the content type or encoding when the $contentType parameter is not explicitly provided in the signature.

Parameters

$uris as xs:string*

Lists the HDFS file URIs. Required.

$contentType as xs:string?

The file content type. It may have the charset attribute.

Returns

document-node()* with two root elements. See "Tika Parser Output Format".

6.8.3 Tika Parser Output Format

The result of Tika parsing is a document node with two root elements:

  • Root element #1 is an XHTML content produced by Tika.

  • Root element #2 is the document metadata extracted by Tika.

The format of the root elements look like these:

Root element #1

<html xmlns="http://www.w3.org/1999/xhtml">
...textual content of Tika HTML...
</html>

Root element #2

<tika:metadata xmlns:tika="oxh:tika">
   <tika:property name="Name_1">VALUE_1</tika:property>
   <tika:property name="NAME_2">VALUE_2</tika:property>
</tika:metadata>

6.8.4 Tika Adapter Configuration Properties

The following Hadoop properties control the behavior of Tika adapter:

oracle.hadoop.xquery.tika.html.asis

Type:Boolean

Default Value: false.

Description: When this is set to TRUE, then all the HTML elements are omitted during parsing. When this is set to FALSE, then only the safe elements are omitted during parsing.

oracle.hadoop.xquery.tika.locale

Type:Comma-separated list of strings

Default Value:Not Defined.

Description:Defines the locale to be used by some Tika parsers such as Microsoft Office document parser. Only three strings are allowed: language, country, and variant. The strings country and variant are optional. When locale is not defined, then the system locale is used. When the strings are defined it must correspond to the java.util.Locale specification format mentioned in http://docs.oracle.com/javase/7/docs/api/java/util/Locale.htmland the locale can be constructed as follows:

  • If only language is specified, then the locale is constructed from the language.

  • If the language and country are specified, then the locale is constructed from both language and country

  • If language, country, and variant are specified, then the locale is constructed from language, country, and variant.

6.8.5 Examples of Tika File Adapter Functions

Example 6-22 Using Built-in Functions to Index PDF documents with Cloudera Search

This example query uses Tika to parse PDF files into HTML form and then add the HTML documents into Solr's full-text index.

*bigdata*.pdf
 

The following query indexes the HDFS files:

import module "oxh:tika";
import module "oxh:solr";
 
for $doc in tika:collection("*bigdata*.pdf")
let $docid := data($doc//*:meta[@name eq "resourceName"]/@content)[1]
let $body := $doc//*:body[1]
return
   solr:put(
        <doc> 
            <field name="id">{ $docid }</field>
            <field name="text">{ string($body) }</field>
            <field name="content">{ serialize($doc/*:html) }</field>
         </doc> 
   )
 

The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.

Example 6-23 Using Built-in Functions to Index HTML documents with Cloudera Search

This example query uses sequence files and Tika to parse, where key is an URL and value is a html.

import module "oxh:tika";
import module "oxh:solr";
import module "oxh:seq";

for $doc in seq:collection-tika(“/path/to/seq/files/*")
let $docid := document-uri($doc)
let $body := $doc//*:body[1]
return
   solr:put(
      <doc>
         <field name="id">{ $docid }</field>
         <field name="text">{ string($body) }</field>
         <field name="content">{ serialize($doc/*:html) }</field>
      </doc>
   )

The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.

6.9 XML File Adapter

The XML file adapter provides access to XML files stored in HDFS. The adapter optionally splits individual XML files so that a single file can be processed in parallel by multiple tasks.

This adapter is described in the following topics:

6.9.1 Built-in Functions for Reading XML Files

To use the built-in functions in your query, you must import the XML file module as follows:

import module "oxh:xmlf";

The XML file module contains the following functions:

See "Examples of XML File Adapter Functions."

6.9.1.1 xmlf:collection (Single Task)

Accesses a collection of XML documents in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task.

This function automatically decompresses files compressed with a Hadoop-supported codec.

Note:

HDFS does not perform well when data is stored in many small files. For large data sets with many small XML documents, use Hadoop sequence files and the Sequence File Adapter.

Signature

declare %xmlf:collection function
   xmlf:collection($uris as xs:string*) as document-node()* external;

Parameters

$uris: The XML file URIs

Returns

One XML document for each file

6.9.1.2 xmlf:collection-multipart (Single Task)

Accesses a collection of XML documents in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task. This function is the same as xmlf:collection except that each file may contain multiple well-formed XML documents concatenated together.

This function automatically decompresses files compressed with a Hadoop-supported codec. For example, a file containing multiple XML documents could be compressed using GZIP and then accessed directly by this function.

Signature

declare %xmlf:collection("multipart")
 function    xmlf:collection($uris as xs:string*) as document-node()* external;

Parameters

$uris

The XML file URIs.

Returns

One or more XML documents for each file.

6.9.1.3 xmlf:collection (Multiple Tasks)

Accesses a collection of XML documents in HDFS. The files might be split and processed by multiple tasks simultaneously, which enables very large XML files to be processed efficiently. The function returns only elements that match a specified name.

This function does not automatically decompress files. It only supports XML files that meet certain requirements. See "Restrictions on Splitting XML Files."

Signature

declare %xmlf:collection function
   xmlf:collection($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;

Parameters

$uris

The XML file URIs

$names

The names of the elements to be returned by the function. The names can be either strings or QNames. For QNames, the XML parser uses the namespace binding implied by the QName prefix and namespace.

Returns

Each element that matches one of the names specified by the $names argument

6.9.2 Custom Functions for Reading XML Files

You can use the following annotations to define functions that read collections of XML files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading XML files must have one of the following signatures:

declare %xmlf:collection(["xml"|"multipart"]) [additional annotations]
    function local:myFunctionName($uris as xs:string*) as node()* external;

declare %xmlf:collection("xml") [additional annotations]
    function local:myFunctionName($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;

Annotations

%xmlf:collection

Declares the collection function. Required.

The method parameter is one of the following values:

  • xml: Each value is parsed as XML

  • multipart: Each value (or, file) may contain a concatenation of multiple well-formed XML documents. This method cannot be used with parallel XML parsing. (See xmlf:split and the two-argument function signature.)

%xmlf:split("element-name1"[,... "element-nameN")

Specifies the element names used for parallel XML parsing. You can use this annotation instead of the $names argument.

When this annotation is specified, only the single-argument version of the function is allowed. This restriction enables the element names to be specified statically, so they do not need to be specified when the function is called.

%output:encoding("charset")

Identifies the text encoding of the input documents.

When this encoding is used with the %xmlf:split annotation or the $names argument, only ISO-8859-1, US-ASCII, and UTF-8 are valid encodings. Otherwise, the valid encodings are those supported by the JVM. UTF-8 is assumed when this annotation is omitted.

See Also:

"Supported Encodings" in the Oracle Java SE documentation at

http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

%xmlf:split-namespace("prefix", "namespace")

This annotation provides extra namespace declarations to the parser. You can specify it multiple times to declare one or more namespaces.

Use this annotation to declare the namespaces of ancestor elements. When XML is processed in parallel, only elements that match the specified names are processed by an XML parser. If a matching element depends on the namespace declaration of one of its ancestor elements, then the declaration is not visible to the parser and an error may occur.

These namespace declarations can also be used in element names when specifying the split names. For example:

declare 
    %xmlf:collection 
    %xmlf:split("eg:foo") 
    %xmlf:split-namespace("eg", "http://example.org")
    function local:myFunction($uris as xs:string*) as document-node() external;
%xmlf:split-entity("entity-name", "entity-value")

Provides entity definitions to the XML parser. When XML is processed in parallel, only elements that match the specified split names are processed by an XML parser. The DTD of an input document that is split and processed in parallel is not processed.

In this example, the XML parser expands &foo; entity references as "Hello World":

%xmlf:split-entity("foo","Hello World")
%xmlf:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-max(1024)
%xmlf:split-max("1024")
%xmlf:split-max("1K")
%xmlf:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%xmlf:split-min(1024)
%xmlf:split-min("1024")
%xmlf:split-min("1K")

Notes

Restrictions on Splitting XML Files

Individual XML documents can be processed in parallel when the element names are specified using either the $names argument or the $xmlf:split annotation.

The input documents must meet the following constraints to be processed in parallel:

  • XML cannot contain a comment, CDATA section, or processing instruction that contains text that matches one of the specified element names (that is, a < character followed by a name that expands to a QName). Otherwise, such content might be parsed incorrectly as an element.

  • An element in the file that matches a specified element name cannot contain a descendant element that also matches a specified name. Otherwise, multiple processors might pick up the matching descendant and cause the function to produce incorrect results.

  • An element that matches one of the specified element names (and all of its descendants) must not depend on the namespace declarations of any of its ancestors. Because the ancestors of a matching element are not parsed, the namespace declarations in these elements are not processed.

    You can work around this limitation by manually specifying the namespace declarations with the %xmlf:split-namespace annotation.

Oracle recommends that the specified element names do not match elements in the file that are bigger than the split size. If they do, then the adapter functions correctly but not efficiently.

Processing XML in parallel is difficult, because parsing cannot begin in the middle of an XML file. XML constructs like CDATA sections, comments, and namespace declarations impose this limitation. A parser starting in the middle of an XML document cannot assume that, for example, the string <foo> is a begin element tag, without searching backward to the beginning of the document to ensure that it is not in a CDATA section or a comment. However, large XML documents typically contain sequences of similarly structured elements and thus are amenable to parallel processing. If you specify the element names, then each task works by scanning a portion of the document for elements that match one of the specified names. Only elements that match a specified name are given to a true XML parser. Thus, the parallel processor does not perform a true parse of the entire document.

6.9.3 Examples of XML File Adapter Functions

Example 6-24 Using Built-in Functions to Query XML Files

This example queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.

mydata/comments1.xml
 
<comments date="2013-12-30">
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
</comments>
 
mydata/comments2.xml
 
<comments date="2013-12-31">
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>
 
mydata/comments3.xml
  
<comments date="2014-01-01">
   <comment id="87654" user="mike" text="I don't feel so good."/>
   <comment id="23456" user="john" text="What a beautiful day!">
      <like user="kelly"/>
      <like user="phil"/>
   </comment>
</comments>
 

This query writes the number of comments made each year to a text file. No element names are passed to xmlf:collection, and so it returns three documents, one for each file. Each file is processed serially by a single task.

import module "oxh:xmlf";
import module "oxh:text";

for $comments in xmlf:collection("mydata/comments*.xml")/comments
let $date := xs:date($comments/@date)
group by $year := fn:year-from-date($date)
return 
   text:put($year || ", " || fn:count($comments/comment))

The query creates text files that contain the following lines:

2013, 3
2014, 2

The next query writes the number of comments and the average number of likes for each user. Each input file is split, so that it can be processed in parallel by multiple tasks. The xmlf:collection function returns five elements, one for each comment.

import module "oxh:xmlf";
import module "oxh:text";

for $comment in xmlf:collection("mydata/comments*.xml", "comment")
let $likeCt := fn:count($comment/like)
group by $user := $comment/@user
return 
   text:put($user || ", " || fn:count($comment) || ", " || fn:avg($likeCt))
 

This query creates text files that contain the following lines:

john, 2, 1
kelly, 1, 2
mike, 2, 0.5

Example 6-25 Writing a Custom Function to Query XML Files

The following example declares a custom function to access XML files:

import module "oxh:text";
 
declare 
   %xmlf:collection 
   %xmlf:split("comment")
   %xmlf:split-max("32M")
function local:comments($uris as xs:string*) as element()* external;
 
for $c in local:comments("mydata/comment*.xml")
where $c/@user eq "mike"
return text:put($c/@id)
 

The query creates a text file that contains the following lines:

54321
87654

Example 6-26 Accessing Compressed, Multipart XML Files

Assume that files comments1.xml, comments2.xml, and comments3.xml from example 5-24 are concatenated together and compressed using GZIP to create a single file named comments.xml.gz. For example:

cat comments1.xml comments2.xml comments3.xml | gzip > comments.xml.gz

The following query accesses this multipart, compressed XML file:

import module "oxh:text"; import module "oxh:xmlf";  
for $comment in xmlf:collection-multipart("comments.xml.gz")/comments/comment 
return
   text:put($comment/@id || "," || $comment/@user)

The query creates a text file that contains the following lines:

12345,john 
56789,kelly 
54321,mike 
87654,mike 
23456,john

6.10 Utility Module

The utility module contains ora-fn functions for handling strings and dates. These functions are defined in XDK XQuery, whereas the oxh functions are specific to Oracle XQuery for Hadoop.

The utility functions are described in the following topics:

6.10.1 Oracle XQuery Functions for Duration, Date, and Time

You can manipulate durations, dates, and times in XQuery using Oracle XQuery functions.

The Oracle XQuery functions are in namespace http://xmlns.oracle.com/xdk/xquery/function. Namespace prefixora-fn is predeclared, and the module is automatically imported.

6.10.1.1 ora-fn:date-from-string-with-format

This Oracle XQuery function returns a new date value from a string according to a given pattern.

Signature

ora-fn:date-from-string-with-format($format as xs:string?,
                                    $dateString as xs:string?, 
                                    $locale as xs:string*)
                                   as xs:date?

ora-fn:date-from-string-with-format($format as xs:string?,
                                    $dateString as xs:string?)
                                   as xs:date?

Parameters

$format: The pattern; see Format Argument

$dateString: An input string that represents a date

$locale: A one- to three-field value that represents the locale; see Locale Argument

Example

This example returns the specified date in the current time zone:

ora-fn:date-from-string-with-format("yyyy-MM-dd G", "2013-06-22 AD")

6.10.1.2 ora-fn:date-to-string-with-format

This Oracle XQuery function returns a date string with a given pattern.

Signature

ora-fn:date-to-string-with-format($format as xs:string?,
                                  $date as xs:date?,
                                  *$locale as xs:string?)
                                 as xs:string?

ora-fn:date-to-string-with-format($format as xs:string?,
                                  $date as xs:date?)
                                 as xs:string?

Parameters

$format: The pattern; see Format Argument

$date: The date

$locale: A one- to three-field value that represents the locale; see Locale Argument

Example

This example returns the string 2013-07-15:

ora-fn:date-to-string-with-format("yyyy-mm-dd", xs:date("2013-07-15"))

6.10.1.3 ora-fn:dateTime-from-string-with-format

This Oracle XQuery function returns a new date-time value from an input string, according to a given pattern.

Signature

ora-fn:dateTime-from-string-with-format($format as xs:string?, 
                                        $dateTimeString as xs:string?,
                                        $locale as xs:string?)
                                       as xs:dateTime?

ora-fn:dateTime-from-string-with-format($format as xs:string?,
                                        $dateTimeString as xs:string?)
                                       as xs:dateTime?

Parameters

$format: The pattern; see Format Argument

$dateTimeString: The date and time

$locale: A one- to three-field value that represents the locale; see Locale Argument

Examples

This example returns the specified date and 11:04:00AM in the current time zone:

ora-fn:dateTime-from-string-with-format("yyyy-MM-dd 'at' hh:mm", 
                                        "2013-06-22 at 11:04")

The next example returns the specified date and 12:00:00AM in the current time zone:

ora-fn:dateTime-from-string-with-format("yyyy-MM-dd G",
                                        "2013-06-22 AD") 

6.10.1.4 ora-fn:dateTime-to-string-with-format

This Oracle XQuery function returns a date and time string with a given pattern.

Signature

ora-fn:dateTime-to-string-with-format($format as xs:string?,
                                      $dateTime as xs:dateTime?,
                                      $locale as xs:string?)
                                     as xs:string?

ora-fn:dateTime-to-string-with-format($format as xs:string?,
                                      $dateTime as xs:dateTime?)
                                     as xs:string?

Parameters

$format: The pattern; see Format Argument

$dateTime: The date and time

$locale: A one- to three-field value that represents the locale; see Locale Argument

Examples

This example returns the string 07 JAN 2013 10:09 PM AD:

ora-fn:dateTime-to-string-with-format("dd MMM yyyy hh:mm a G",
                                      xs:dateTime("2013-01-07T22:09:44"))

The next example returns the string "01-07-2013":

ora-fn:dateTime-to-string-with-format("MM-dd-yyyy",
                                      xs:dateTime("2013-01-07T22:09:44")) 

6.10.1.5 ora-fn:time-from-string-with-format

This Oracle XQuery function returns a new time value from an input string, according to a given pattern.

Signature

ora-fn:time-from-string-with-format($format as xs:string?,
                                    $timeString as xs:string?,
                                    $locale as xs:string?)
                                   as xs:time?

ora-fn:time-from-string-with-format($format as xs:string?,
                                    $timeString as xs:string?)
                                   as xs:time?

Parameters

$format: The pattern; see Format Argument

$timeString: The time

$locale: A one- to three-field value that represents the locale; see Locale Argument

Example

This example returns 9:45:22 PM in the current time zone:

ora-fn:time-from-string-with-format("HH.mm.ss", "21.45.22")

The next example returns 8:07:22 PM in the current time zone:

fn-bea:time-from-string-with-format("hh:mm:ss a", "8:07:22 PM") 

6.10.1.6 ora-fn:time-to-string-with-format

This Oracle XQuery function returns a time string with a given pattern.

Signature

ora-fn:time-to-string-with-format($format as xs:string?,
                                  $time as xs:time?,
                                  $locale as xs:string?)
                                 as xs:string?

ora-fn:time-to-string-with-format($format as xs:string?, $time as xs:time?) as xs:string?

Parameters

$format: The pattern; see Format Argument

$time: The time

$locale: A one- to three-field value that represents the locale; see Locale Argument

Examples

This example returns the string "10:09 PM":

ora-fn:time-to-string-with-format("hh:mm a", xs:time("22:09:44"))

The next example returns the string "22:09 PM":

ora-fn:time-to-string-with-format("HH:mm a", xs:time("22:09:44"))

6.10.1.7 Format Argument

The $format argument identifies the various fields that compose a date or time value.

6.10.1.8 Locale Argument

The $locale represents a specific geographic, political, or cultural region.

It is defined by up to three fields:

  1. Language code: The ISO 639 alpha-2 or alpha-3 language code, or the registered language subtags of up to eight letters. For example, en for English and ja for Japanese.

  2. Country code: The ISO 3166 alpha-2 country code or the UN M.49 numeric-3 area code. For example, US for the United States and 029 for the Caribbean.

  3. Variant: Indicates a variation of the locale, such as a particular dialect. Order multiple values in order of importance and separate them with an underscore (_). These values are case sensitive.

See Also:

6.10.2 Oracle XQuery Functions for Strings

You can manipulate strings in XQuery using Oracle XQuery functions.

The Oracle XQuery functions are in namespace http://xmlns.oracle.com/xdk/xquery/function. Namespace prefixora-fn is predeclared, and the module is automatically imported.

6.10.2.1 ora-fn:pad-left

Adds padding characters to the left of a string to create a fixed-length string. If the input string exceeds the specified size, then it is truncated to return a substring of the specified length. The default padding character is a space (ASCII 32).

Signature

ora-fn:pad-left($str as xs:string?,
                $size as xs:integer?,
                $pad as xs:string?)
               as xs:string?

ora-fn:pad-left($str as xs:string?,
                $size as xs:integer?)
               as xs:string?

Parameters

$str: The input string

$size: The desired fixed length, which is obtained by adding padding characters to $str

$pad: The padding character

If either argument is an empty sequence, then the function returns an empty sequence.

Examples

This example prefixes "01" to the input string up to the maximum of six characters. The returned string is "010abc". The function returns one complete and one partial pad character.

ora-fn:pad-left("abc", 6, "01")

The example returns only "ab" because the input string exceeds the specified fixed length:

ora-fn:pad-left("abcd", 2, "01")

This example prefixes spaces to the string up to the specified maximum of six characters. The returned string has a prefix of two spaces: " abcd":

ora-fn:pad-left("abcd", 6)

The next example returns only "ab" because the input string exceeds the specified fixed length:

ora-fn:pad-left("abcd", 2)

6.10.2.2 ora-fn:pad-right

Adds padding characters to the right of a string to create a fixed-length string. If the input string exceeds the specified size, then it is truncated to return a substring of the specified length. The default padding character is a space (ASCII 32).

Signature

ora-fn:pad-right($str as xs:string?,
                 $size as xs:integer?,
                 $pad as xs:string?)
                as xs:string?

ora-fn:pad-right($str as xs:string?,
                 $size as xs:integer?)
                as xs:string?

Parameters

$str: The input string

$size: The desired fixed length, which is obtained by adding padding characters to $str

$pad: The padding character

If either argument is an empty sequence, then the function returns an empty sequence.

Examples

This example appends "01" to the input string up to the maximum of six characters. The returned string is "abc010". The function returns one complete and one partial pad character.

ora-fn:pad-right("abc", 6, "01")

This example returns only "ab" because the input string exceeds the specified fixed length:

ora-fn:pad-right("abcd", 2, "01")

This example appends spaces to the string up to the specified maximum of six characters. The returned string has a suffix of two spaces: "abcd ":

ora-fn:pad-right("abcd", 6)

The next example returns only "ab" because the input string exceeds the specified fixed length:

ora-fn:pad-right("abcd", 2)

6.10.2.3 ora-fn:trim

Removes any leading or trailing white space from a string.

Signature

ora-fn:trim($input as xs:string?) as xs:string?

Parameters

$input: The string to trim. If $input is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.

Example

This example returns the string "abc":

ora-fn:trim("  abc  ")

6.10.2.4 ora-fn:trim-left

Removes any leading white space.

Signature

ora-fn:trim-left($input as xs:string?) as xs:string?

Parameters

$input: The string to trim. If $input is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.

Example

This example removes the leading spaces and returns the string "abc    ":

ora-fn:trim-left("    abc    ")

6.10.2.5 ora-fn:trim-right

Removes any trailing white space.

Signature

ora-fn:trim-right($input as xs:string?) as xs:string?

Parameters

$input: The string to trim. If $input is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.

Example

This example removes the trailing spaces and returns the string "    abc":

ora-fn:trim-left("    abc    ")

6.11 Hadoop Module

These functions are in the http://xmlns.oracle.com/hadoop/xquery namespace. The oxh prefix is predeclared and the module is automatically imported.

The Hadoop module is described in the following topic:

  • Hadoop Functions

6.11.1 Built-in Functions for Using Hadoop

The following functions are built in to Oracle XQuery for Hadoop:

6.11.1.1 oxh:find

Returns a sequence of file paths that match a pattern.

Signature

oxh:find($pattern as xs:string?) as xs:string*

Parameters

$pattern: The file pattern to search for

See Also:

For the file pattern, the globStatus method in the Apache Hadoop API at

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

6.11.1.2 oxh:increment-counter

Increments a user-defined MapReduce job counter. The default increment is one (1).

Signature

oxh:increment-counter($groupName as xs:string, $counterName as xs:string, $value as xs:integer

oxh:increment-counter($groupName as xs:string, $counterName as xs:string

Parameters

$groupName: The group of counters that this counter belongs to.

$counterName: The name of a user-defined counter

$value: The amount to increment the counter

6.11.1.3 oxh:println

Prints a line of text to stdout of the Oracle XQuery for Hadoop client process. Use this function when developing queries.

Signature

declare %updating function oxh:println($arg as xs:anyAtomicType?)

Parameters

$arg: A value to add to the output. A cast operation first converts it to string. An empty sequence is handled the same way as an empty string.

Example

This example prints the values of data.txt to stdout:

for $i in text:collection("data.txt") 
return oxh:println($i)

6.11.1.4 oxh:println-xml

Prints a line of text or XML to stdout of the Oracle XQuery for Hadoop client process. Use this function when developing queries and printing nodes of an XML document.

Signature

declare %updating function oxh:println-xml($arg as item()?)

Parameters

$arg: A value to add to the output. The input item is converted into a text as defined by XSLT 2.0 and XQuery 1.0 Serialization specifications. An empty sequence is handled the same way as an empty string.

6.11.1.5 oxh:property

Returns the value of a Hadoop configuration property.

Signature

oxh:property($name as xs:string?) as xs:string?

Parameters

$name: The configuration property

6.12 Serialization Annotations

Several adapters have serialization annotations (%output:*). The following lists identify the serialization parameters that Oracle XQuery for Hadoop supports.

Serialization parameters supported for the text output method:

  • encoding: Any encoding supported by the JVM

  • normalization-form: none, NFC, NFD, NFKC, NFKD

Serialization parameters supported for the xml output method, using any values permitted by the XQuery specification:

  • cdata-section-elements

  • doctype-public

  • doctype-system

  • encoding

  • indent

  • normalization-form

  • omit-xml-declaration

  • standalone

See Also:

"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization, at locations like the following:

http://www.w3.org/TR/xslt-xquery-serialization/#XML_DOCTYPE

http://www.w3.org/TR/xslt-xquery-serialization/#XML_CDATA-SECTION-ELEMENTS