This chapter describes the adapters available in Oracle XQuery for Hadoop:
This chapter also describes several other library modules:
The Avro file adapter provides functions to read and write Avro container files in HDFS. It is described in the following topics:
Note:
Additional Configuration Steps for HDP Users
Oracle XQuery for Hadoop has been verified to run on both Cloudera’s CDH5 and Hortonwork’s HDP 2.3.3. However, to run queries that write to Avro container files in HDP 2.3.2, you must change the OXH classpath definition to use avro-mapred-1.7.4-hadoop2.jar.Download the JAR from the Apache archive https://archive.apache.org/dist/avro/avro-1.7.4/java/
In $OXH_HOME/lib/oxh-lib.xml locate the following path tag:
<path id="oracle.hadoop.xquery.avro.lib"> <fileset dir="${oracle.hadoop.loader.olh_home}/jlib" erroronmissingdir="false"> <include name="avro-mapred*.jar"/> </fileset> </path>
Replace the path tag above with the following revision.[DIRECTORY]
in this example is a placeholder. Replace it with the directory path to the JAR.
<path id="oracle.hadoop.xquery.avro.lib”> <fileset dir="[DIRECTORY]"> <include name="avro-mapred-1.7.4-hadoop2.jar"/>\ </fileset> </path>
To use the built-in functions in your query, you must import the Avro file module as follows:
import module "oxh:avro";
The Avro file module contains the following functions:
There are no built-in functions for writing Avro container files. To write Avro files, you must use a custom function that specifies the Avro writer schema.
Accesses a collection of Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."
Signature
declare %avro:collection("avroxml") function avro:collection-avroxml($uris as xs:string*) as element()* external;
Parameters
$uris
: The Avro file URIs
Returns
One XML element for each Avro object.
Retrieves an entry from an Avro map modeled as XML
If you omit the $map
parameter, then the behavior is identical to calling the two-argument function and using the context item for $map
.
Signature
avro:get($key as xs:string?, $map as node()?) as element(oxh:entry)? avro:get($key as xs:string?) as element(oxh:entry)?
Returns
The value of this XPath expression:
$map/oxh:entry[@key eq $key]
Example
These function calls are equivalent:
$var/avro:get("key") avro:get("key", $var) $var/oxh:entry[@key eq "key"]
In this example, $var
is an Avro map modeled as XML. See "Reading Maps."
You can use the following annotations to define functions that read collections of Avro container files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading Avro files must have the following signature:
declare %avro:collection("avroxml") [additional annotations] function local:myFunctionName($uris as xs:string*) as element()* external;
Annotations
Declares the avroxml
collection function. Required.
A collection function accesses Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."
Provides the Avro reader schema as the value of the annotation. Optional.
The objects in the file are mapped to the reader schema when it is specified. For example:
%avro:schema(' { "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] } ')
You cannot combine this annotation with %avro:schema-file
or %avro:schema-kv
.
See Also:
"Schema Resolution" in the Apache Avro Specification at
http://avro.apache.org/docs/current/spec.html#Schema+Resolution
Like %avro:schema
, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system. Optional.
For example, %avro:schema-file("schemas/person.avsc")
.
You cannot combine this annotation with %avro:schema
or %avro:schema-kv
.
Like %avro:schema
, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. Optional.
For example, %avro:schema-kv("org.example.PersonRecord")
.
You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."
You cannot combine this annotation with %avro:schema
or %avro:schema-file
.
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%avro:split-max(1024) %avro:split-max("1024") %avro:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%avro:split-min(1024) %avro:split-min("1024") %avro:split-min("1K")
You can use the following annotations to define functions that write Avro files.
Signature
Custom functions for writing Avro files must have the following signature:
declare %avro:put("avroxml") [additional annotations] local:myFunctionName($value as item()) external;
Annotations
Declares the avroxml
put function. Required.
An Avro schema must be specified using one of the following annotations:
%avro:schema
%avro:schema-file
%avro:schema-kv
The input XML value is converted to an instance of the schema. See "Writing XML as Avro."
Specifies the schema of the files. For example:
%avro:schema(' { "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] } ')
You cannot combine this annotation with %avro:schema-file
or %avro:schema-kv
.
Like %avro:schema
, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system.
For example: %avro:schema-file("schemas/person.avsc")
You cannot combine this annotation with %avro:schema
or %avro:schema-kv
.
Like %avro:schema
, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog.
For example: %avro:schema-kv("org.example.PersonRecord")
You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."
You cannot combine this annotation with %avro:schema
or %avro:schema-file
.
Specifies the compression format used on the output.
The codec is one of the following string literal values:
deflate: The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.
snappy: This algorithm is designed for high speed and moderate compression.
The default is no compression.
The level is an integer value. It is optional and only supported when codec is deflate
.
For example:
%avro:compress("snappy") %avro:compress("deflate") %avro:compress("deflate", 3)
Specifies the output file name prefix. The default prefix is part
.
These examples use the following text file in HDFS:
mydata/ages.txt john,45 kelly,36 laura, mike,27
Example 6-1 Converting a Text File to Avro
The following query converts the file into compressed Avro container files:
import module "oxh:text"; declare %avro:put("avroxml") %avro:compress("snappy") %avro:schema(' { "type": "record", "name": "AgeRec", "fields" : [ {"name": "user", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] } ') function local:put($arg as item()) external; for $line in text:collection("mydata/ages.txt") let $split := fn:tokenize($line, ",") return local:put( <rec> <user>{$split[1]}</user> { if ($split[2] castable as xs:int) then <age>{$split[2]}</age> else () } </rec> )
The query generates an Avro file with the following records, represented here as JSON:
{"user":"john","age":{"int":45}} {"user":"kelly","age":{"int":36}} {"user":"laura","age":null} {"user":"mike","age":{"int":27}}
Example 6-2 Querying Records in Avro Container Files
The next query selects records in which the age is either null or greater than 30, from the myoutput directory. The query in Example 6-1generated the records.
import module "oxh:text"; import module "oxh:avro"; for $rec in avro:collection-avroxml("myoutput/part*.avro") where $rec/age/nilled() or $rec/age gt 30 return text:put($rec/user)
This query creates files that contain the following lines:
john kelly laura
This section describes how Oracle XQuery for Hadoop converts data between Avro and XML:
Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml
method, which you can use with the collection functions to read Avro records as XML. After the Avro is converted to XML, you can query and transform the data using XQuery.
The following topics describe how Oracle XQuery for Hadoop reads Avro:
An Avro record is converted to an <oxh:item>
element with one child element for each field in the record.
For example, consider the following Avro schema:
{ "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] }
This is an instance of the record modeled as XML:
<oxh:item> <full_name>John Doe</full_name> <age>46</age> </oxh:item>
Converting Avro records to XML enables XQuery to query them. The next example queries an Avro container file named person.avro, which contains Person records. The query converts the records to a CSV text file in which each line contains the full_name
and age
values:
import module "oxh:avro"; import module "oxh:text"; for $x in avro:collection-avroxml("person.avro") return text:put($x/full_name || "," || $x/age)
Null values are converted to nilled elements. A nilled element has an xsi:nil
attribute set to true
; it is always empty. You can use the XQuery fn:nilled
function to test if a record field is null. For example, the following query writes the name of Person records that have a null value for age:
import module "oxh:avro"; import module "oxh:text"; for $x in avro:collection-avroxml("person.avro") where $x/age/nilled() return text:put($x/full_name)
For nested records, the fields of the inner schema become child elements of the element that corresponds to the field in the outer schema. For example, this schema has a nested record:
{ "type": "record", "name": "PersonAddress", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "address", "type": { "type" : "record", "name" : "Address", "fields" : [ { "name" : "street", "type" : "string" }, { "name" : "city", "type" : "string" } ] } } ] }
This is an instance of the record as XML:
<oxh:item> <full_name>John Doe</full_name> <address> <street>123 First St.</street> <city>New York</city> </address> </oxh:item>
The following example queries an Avro container file named people-address.avro that contains PersonAddress
records, and writes the names of the people that live in New York to a text file:
import module "oxh:avro"; import module "oxh:text"; for $person in avro:collection-avroxml("examples/person-address.avro") where $person/address/city eq "New York" return text:put($person/full_name)
Avro map values are converted to an element that contains one child <oxh:entry>
element for each entry in the map. For example, consider the following schema:
{ "type": "record", "name": "PersonProperties", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "properties", "type": {"type": "map", "values": "string"} } ] }
This is an instance of the schema as XML:
<oxh:item> <full_name>John Doe</full_name> <properties> <oxh:entry key="employer">Example Inc</oxh:entry> <oxh:entry key="hair color">brown</oxh:entry> <oxh:entry key="favorite author">George RR Martin</oxh:entry> </properties> </oxh:item>
The following example queries a file named person-properties.avro
that contains PersonAddress
records, and writes the names of the people that are employed by Example Inc. The query shows how regular XPath expressions can retrieve map entries. Moreover, you can use the avro:get
function as a shortcut to retrieve map entries.
import module "oxh:avro"; import module "oxh:text"; for $person in avro:collection-avroxml("person-properties.avro") where $person/properties/oxh:entry[@key eq "employer"] eq "Example Inc" return text:put($person/full_name)
The following query uses the avro:get
function to retrieve the employer
entry. It is equivalent to the previous query.
import module "oxh:avro"; import module "oxh:text"; for $person in avro:collection-avroxml("person-properties.avro") where $person/properties/avro:get("employer") eq "Example Inc" return text:put($person/full_name)
You can use XQuery fn:nilled
function to test for null values. This example returns true if the map entry is null:
$var/avro:get("key")/nilled()
Oracle XQuery for Hadoop converts Avro array values to an element that contains a child <oxh:item>
element for each item in the array. For example, consider the following schema:
{ "type": "record", "name": "PersonScores", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "scores", "type": {"type": "array", "items": "int"} } ] }
This is an instance of the schema as XML:
<oxh:item> <full_name>John Doe</full_name> <scores> <oxh:item>128</oxh:item> <oxh:item>151</oxh:item> <oxh:item>110</oxh:item> </scores> </oxh:item>
The following example queries a file named person-scores.avro that contains PersonScores
records, and writes the sum and count of scores for each person:
import module "oxh:avro"; import module "oxh:text"; for $person in avro:collection-avroxml("person-scores.avro") let $scores := $person/scores/* return text:put($person/full_name || "," || sum($scores) || "," || count($scores))
You can access a specific element of an array by using a numeric XPath predicate. For example, this path expression selects the second score. XPath indexing starts at 1 (not 0).
$person/scores/oxh:item[2]
Oracle XQuery for Hadoop converts an instance of an Avro union type based on the actual member type of the value. The name of the member type is added as an XML avro:type
attribute to the enclosing element, which ensures that queries can distinguish between instances of different member types. However, the attribute is not added for trivial unions where there are only two member types and one of them is null.
For example, consider the following union of two records:
[ { "type": "record", "name": "Person1", "fields" : [ {"name": "full_name", "type": "string"} ] } , { "type": "record", "name": "Person2", "fields" : [ {"name": "fname", "type": "string"} ] } ]
This is an instance of the schema as XML:
<oxh:item avro:type="Person2"> <fname>John Doe</fname> </oxh:item>
The following example queries a file named person-union.avro that contains instances of the previous union schema, and writes the names of the people from both record types to a text file:
import module "oxh:avro"; import module "oxh:text"; for $person in avro:collection-avroxml("examples/person-union.avro") return if ($person/@avro:type eq "Person1") then text:put($person/full_name) else if ($person/@avro:type eq "Person2") then text:put($person/fname) else error(xs:QName("UNEXPECTED"), "Unexpected record type:" || $person/@avro:type)
The following table shows how Oracle XQuery for Hadoop maps Avro primitive types to XQuery atomic types.
Table 6-1 Mapping Avro Primitive Types to XQuery Atomic Types
Avro | XQuery |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Avro null values are mapped to empty nilled elements. To distinguish between a null string value and an empty string value, use the XQuery nilled
function. This path expression only returns true if the field value is null:
$record/field/fn:nilled()
Avro fixed values are mapped to xs:hexBinary
, and enums
are mapped to xs:string
.
Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml
method, which you can use with the put functions to write XML as Avro. The following topics describe how the XML is converted to an Avro instance:
Oracle XQuery for Hadoop maps the XML to an Avro record schema by matching the child element names to the field names of the record. For example, consider the following Avro schema:
{ "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] }
You can use the following XML element to write an instance of this record in which the full_name
field is John Doe and the age
field is 46. The name of the root element (Person
) is inconsequential. Only the names of the child elements are used to map to the Avro record fields (full_name
and age
).
<person> <full_name>John Doe</full_name> <age>46</age> </person>
The next example uses the following CSV file named people.csv
:
John Doe,46 Jane Doe,37 . . .
This query converts values from the CSV file to Avro Person
records:
import module "oxh:avro"; import module "oxh:text"; declare %avro:put("avroxml") %avro:schema(' { "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] } ') function local:put-person($person as element()) external; for $line in text:collection("people.csv") let $split := tokenize($line, ",") return local:put-person( <person> <full_name>{$split[1]}</full_name> <age>{$split[2]}</age> </person> )
For null values, you can omit the element or set the xsi:nil="true"
attribute. For example, this modified query sets age
to null when the value is not numeric:
. . . for $line in text:collection("people.csv") let $split := tokenize($line, ",") return local:put-person( <person> <full_name>{$split[1]}</full_name> { if ($split[2] castable as xs:int) then <age>{$split[2]}</age> else () } </person> )
In the case of nested records, the values are obtained from nested elements. The next example uses the following schema:
{ "type": "record", "name": "PersonAddress", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "address", "type": { "type" : "record", "name" : "Address", "fields" : [ { "name" : "street", "type" : "string" }, { "name" : "city", "type" : "string" } ] } } ] }
You can use following XML to write an instance of this record:
<person> <full_name>John Doe</full_name> <address> <street>123 First St.</street> <city>New York</city> </address> </person>
Oracle XQuery for Hadoop converts XML to an Avro map with one map entry for each <oxh:entry>
child element. For example, consider the following schema:
{ "type": "record", "name": "PersonProperties", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "properties", "type": {"type": "map", "values": "string"} } ] }
You can use the following XML element to write an instance of this schema in which the full_name
field is John Doe, and the properties field is set to a map with three entries:
<person> <full_name>John Doe</full_name> <properties> <oxh:entry key="hair color">brown</oxh:entry> <oxh:entry key="favorite author">George RR Martin</oxh:entry> <oxh:entry key="employer">Example Inc</oxh:entry> </properties> </person>
Oracle XQuery for Hadoop converts XML to an Avro array with one item for each <oxh:item>
child element. For example, consider the following schema:
{ "type": "record", "name": "PersonScores", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "scores", "type": {"type": "array", "items": "int"} } ] }
You can use the following XML element to write an instance of this schema in which the full_name
field is John Doe and the scores field is set to [128, 151, 110]:
<person> <full_name>John Doe</full_name> <scores> <oxh:item>128</oxh:item> <oxh:item>151</oxh:item> <oxh:item>110</oxh:item> </scores> </person>
When writing an Avro union type, Oracle XQuery for Hadoop bases the selection of a member type on the value of the avro:type
attribute.
This example uses the following schema:
[ { "type": "record", "name": "Person1", "fields" : [ {"name": "full_name", "type": "string"} ] } , { "type": "record", "name": "Person2", "fields" : [ {"name": "fname", "type": "string"} ] } ]
The following XML is mapped to an instance of the Person1
record:
<person avro:type="Person1"> <full_name>John Doe</full_name> </person>
This XML is mapped to an instance of the Person2
record:
<person avro:type="Person2"> <fname>John Doe</fname> </person>
The avro:type
attribute selects the member type of the union. For trivial unions that contain a null and one other type, the avro:type
attribute is unnecessary. If the member type cannot be determined, then an error is raised.
To map primitive values, Oracle XQuery for Hadoop uses the equivalent data types shown in Table 6-1 to cast an XML value to the corresponding Avro type. If the value cannot be converted to the Avro type, then an error is raised.
This example uses the following schema:
{ "type": "record", "name": "Person", "fields" : [ {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] }
Attempting to map the following XML to an instance of this schema raises an error, because the string
value apple
cannot be converted to an int
:
<person> <full_name>John Doe</full_name> <age>apple</age> </person>
The JSON file adapter provides access to JSON files stored in HDFS. It also contains functions for working with JSON data embedded in other file formats. For example, you can query JSON that is stored as lines in a large text file by using json:parse-as-xml
with the text:collection
function.
Processing a single JSON file in parallel is not currently supported. A set of JSON files can be processes in parallel, with sequential processing of each file.
The JSON module is described in the following topics:
To use the built-in functions in your query, you must import the JSON file adapter as follows:
import module "oxh:json";
The JSON module contains the following functions:
As of Big Data Connectors Release 4.9, Oracle XQuery for Hadoop also supports XQuery 3.1 including the standard facilities for processing JSON, including: fn:parse-json
, fn:json-to-xml
, and fn:xml-to-json
Accesses a collection of JSON files in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task.
The JSON file adapter automatically decompresses files compressed with a Hadoop-supported compression codec.
Signature
json:collection-jsonxml($uris as xs:string*) as element()* external;
Parameters
$uris
: The JSON file URIs
Returns
XML elements that model the JSON values. See "About Converting JSON Data Formats to XML."
Parses a JSON value as XML.
Signature
json:parse-as-xml($arg as xs:string?) as element(*)?
Parameters
$arg
: Can be the empty sequence.
Returns
An XML element that models the JSON value. An empty sequence if $arg
is an empty sequence. See "About Converting JSON Data Formats to XML."
Retrieves an entry from a JSON object modeled as XML. See "About Converting JSON Data Formats to XML."
Signature
json:get($key as xs:string?, $obj as node()?) as element(oxh:entry)? json:get($key as xs:string?) as element(oxh:entry)?
Parameters
$key
: The JSON data key
$obj
: The JSON object value
Returns
The value of the following XPath expression:
$obj/oxh:entry[@key eq $key]
If $input
not present, then the behavior is identical to calling the two-argument function using the context item for $obj
. See the Notes.
Notes
These function calls are equivalent:
$var/json:get("key") json:get("key", $var) $var/oxh:entry[@key eq "key"]
$var
is a JSON object modeled as XML. See "Reading Maps."
You can use the following annotations to define functions that read collections of JSON files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading JSON files must have the following signature:
declare %json:collection("jsonxml") [additional annotations] function local:myFunctionName($uris as xs:string*) as element()* external;
Annotations
Declares the collection function. The annotation parameter must be jsonxml
.
Identifies the text encoding of the input files.
The valid encodings are those supported by the JVM. If this annotation is omitted, then the encoding is automatically detected from the JSON file as UTF-8, UTF-16 big-endian serialization (BE) or little-endian serialization (LE), or UTF-32 (BE or LE).
For better performance, omit the encoding annotation if the actual file encoding is specified by JSON Request for Comment 4627, Section 3 "Encoding," on the Internet Engineering Task Force (IETF) website at
Returns
A collection of XML elements. Each element models the corresponding JSON value. See "About Converting JSON Data Formats to XML."
Example 6-3
This example uses the following JSON text files stored in HDFS:
mydata/users1.json [ { "user" : "john", "full name" : "John Doe", "age" : 45 }, { "user" : "kelly", "full name" : "Kelly Johnson", "age" : 32 } ] mydata/users2.json [ { "user" : "laura", "full name" : "Laura Smith", "age" : null }, { "user" : "phil", "full name" : "Phil Johnson", "age" : 27 } ]
The following query selects names of users whose last name is Johnson from users1.json
and users2.json
import module "oxh:text"; import module "oxh:json"; for $user in json:collection-jsonxml("mydata/users*.json")/oxh:item let $fullname := $user/json:get("full name") where tokenize($fullname, "\s+")[2] eq "Johnson" return text:put-text($fullname)
This query generates text files that contain the following lines:
Phil Johnson Kelly Johnson
The remaining examples query the following text file in HDFS:
mydata/users-json.txt { "user" : "john", "full name" : "John Doe", "age" : 45 } { "user" : "kelly", "full name" : "Kelly Johnson", "age" : 32 } { "user" : "laura", "full name" : "Laura Smith", "age" : null } { "user" : "phil", "full name" : "Phil Johnson", "age" : 27 }
Example 6-4
The following query selects the names of users that are older than 30 from users-json.txt
:
import module "oxh:text"; import module "oxh:json"; for $line in text:collection("mydata/users-json.txt") let $user := json:parse-as-xml($line) where $user/json:get("age") gt 30 return text:put($user/json:get("full name"))
This query generates text files that contain the following lines:
John Doe Kelly Johnson
Example 6-5
The next query selects the names of employees that have a null value for age from users-json.txt
:
import module "oxh:text"; import module "oxh:json"; for $line in text:collection("mydata/users-json.txt") let $user := json:parse-as-xml($line) where $user/json:get("age")/nilled() return text:put($user/json:get("full name"))
This query generates a text file that contains the following line:
Laura Smith
Oracle XQuery for Hadoop uses the generic options for specifying configuration properties in the hadoop
command. You can use the -conf
option to identify configuration files, and the -D
option to specify individual properties.
The following configuration properties are equivalent to the Jackson parser options with the same names. You can enter the option name in either upper or lower case. For example, oracle.hadoop.xquery.json.parser.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER
and oracle.hadoop.xquery.json.parser.allow_backslash_escaping_any_character
are equal.
Type: Boolean
Default Value: false
Description: Enables any character to be escaped with a backslash (\). Otherwise, only the following characters can be escaped: quotation mark("), slash (/), backslash (\), backspace, form feed (f), new line (n), carriage return (r), horizontal tab (t), and hexadecimal representations (u
nnnn
)
Type: Boolean
Default Value: false
Description: Allows Java and C++ comments (/* and //) within the parsed text.
Type: Boolean
Default Value: false
Description: Allows Not a Number (NaN) tokens to be parsed as floating number values.
Type: Boolean
Default Value: false
Description: Allows integral numbers to start with zeroes, such as 00001. The zeros do not change the value and can be ignored.
Type: Boolean
Default Value: false
Description: Allow single quotes (') to delimit string values.
Type: Boolean
Default Value: false
Description: Allows JSON strings to contain unquoted control characters (that is, ASCII characters with a decimal value less than 32, including the tab and line feed).
Type: Boolean
Default Value: false
Description: Allows unquoted field names, which are allowed by Javascript but not the JSON specification.
Related Topics
This section describes how JSON data formats are converted to XML. It contains the following topics:
As of Big Data Connectors Release 4.9, Oracle XQuery for Hadoop also supports XQuery 3.1 including the standard facilities for processing JSON, including: fn:parse-json
, fn:json-to-xml
, and fn:xml-to-json
JSON objects are similar to Avro maps and are converted to the same XML structure. See "Reading Maps."
For example, the following JSON object is converted to an XML element:
{ "user" : "john", "full_name" : "John Doe", "age" : 45 }
The object is modeled as the following element:
<oxh:item> <oxh:entry key="user">john</oxh:entry> <oxh:entry key="full_name">John Doe</oxh:entry> <oxh:entry key="age">45</oxh:entry> </oxh:item>
JSON arrays are similar to Avro arrays and are converted to the same XML structure. See "Reading Arrays."
For example, the following JSON array is converted to an XML element:
[ "red", "blue", "green" ]
The array is modeled as the following element:
<oxh:item> <oxh:item>red</oxh:item> <oxh:item>blue</oxh:item> <oxh:item>green</oxh:item> </oxh:item>
The Oracle Database adapter provides custom functions for loading data into tables in Oracle Database.
A custom put function supported by this adapter automatically calls Oracle Loader for Hadoop at run time, either to load the data immediately or to output it to HDFS. You can declare and use multiple custom Oracle Database adapter put functions within a single query. For example, you might load data into different tables or into different Oracle databases with a single query.
Ensure that Oracle Loader for Hadoop is installed on your system, and that the OLH_HOME
environment variable is set to the installation directory. See Step 3 of "Installing Oracle XQuery for Hadoop." Although not required, you might find it helpful to familiarize yourself with Oracle Loader for Hadoop before using this adapter.
The Oracle Database adapter is described in the following topics:
See Also:
"Software Requirements" for the versions of Oracle Database that Oracle Loader for Hadoop supports
You can use the following annotations to define functions that write to tables in an Oracle database either directly or by generating binary or text files for subsequent loading with another utility, such as SQL*Loader.
Signature
Custom functions for writing to Oracle database tables must have the following signature:
declare %oracle:put(["jdbc" | "oci" | "text" | "datapump"]) [%oracle:columns(col1 [, col2...])] [%oracle-property annotations] function local:myPut($column1 [as xs:allowed_type_name[?]], [$column2 [as xs:allowed_type_name[?]], ...]) external;
Annotations
Declares the put function and the output mode. Required.
The optional output_mode parameter can be one of the following string literal values:
jdbc
: Writes to an Oracle database table using a JDBC connection. Default.
See "JDBC Output Format."
oci
: Writes to an Oracle database table using an Oracle Call Interface (OCI) connection.
datapump
: Creates Data Pump files and associated scripts in HDFS for subsequent loading by another utility.
text
: Creates delimited text files and associated scripts in HDFS.
See "Delimited Text Output Format."
For Oracle XQuery for Hadoop to write directly to an Oracle database table using either JDBC or OCI, all systems involved in processing the query must be able to connect to the Oracle Database system. See "About the Modes of Operation."
Identifies a selection of one or more column names in the target table. The order of column names corresponds to the order of the function parameters. See "Parameters." Optional.
This annotation enables loading a subset of the table columns. If omitted, the put function attempts to load all columns of the target table.
Controls various aspects of connecting to the database and writing data. You can specify multiple %oracle-property
annotations. These annotations correspond to the Oracle Loader for Hadoop configuration properties. Every %oracle-property
annotation has an equivalent Oracle Loader for Hadoop configuration property. "Oracle Loader for Hadoop Configuration Properties and Corresponding %oracle-property Annotations" explains this relationship in detail.
The %oracle-property
annotations are optional. However, the various loading scenarios require you to specify some of them or their equivalent configuration properties. For example, to load data into an Oracle database using JDBC or OCI, you must specify the target table and the connection information.
The following example specifies a target table named VISITS
, a user name of db
, a password of password
, and the URL connection string:
%oracle-property:targetTable('visits') %oracle-property:connection.user('db') %oracle-property:connection.password('password') %oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com')
Parameters
Enter a parameter for each column in the same order as the Oracle table columns to load all columns, or use the %oracle:columns
annotation to load selected columns.
Because the correlation between parameters and database columns is positional, the name of the parameter (column1 in the parameter syntax) is not required to match the name of the database column.
You can omit the explicit as xs:allowed_type_name
type declaration for any parameter. For example, you can declare the parameter corresponding to a NUMBER
column simply as $column1
. In this case, the parameter is automatically assigned an XQuery type of item()*
. At run time, the input value is cast to the allowed XQuery type for the corresponding table column type, as described in the following table. For example, data values that are mapped to a column with a NUMBER
data type are automatically cast as xs:decimal
. An error is raised if the cast fails.
Alternatively, you can specify the type or its subtype for any parameter. In this case, compile-time type checking is performed. For example, you can declare a parameter corresponding to a NUMBER
column as $column as xs:decimal
. You can also declare it as any subtype of xs:decimal
, such as xs:integer
.
You can include the ?
optional occurrence indicator for each specified parameter type. This indicator allows the empty sequence to be passed as a parameter value at run time, so that a null is inserted into the database table. Any occurrence indicator other than ?
raises a compile-time error.
The following table describes the appropriate mappings of XQuery data types with the supported Oracle Database data types. In addition to the listed XQuery data types, you can also use the subtypes, such as xs:integer
instead of xs:decimal
. Oracle data types are more restrictive than XQuery data types, and these restrictions are identified in the table.
Table 6-3 Data Type Mappings Between Oracle Database and XQuery
Database Type | XQuery Type |
---|---|
|
Limited by the |
|
Limited by the |
|
Limited by the |
|
Limited by the |
|
Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. If a time zone is specified in the |
|
Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. If a time zone is specified in the |
|
Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. In the offset from UTC, the time-zone hour field is limited to -12:00 to 14:00. Fractional seconds are limited to a precision of 0 to 9 digits. See "About Session Time Zones." |
|
Limited to the range of January 1, 4712 BC, to December 31, 9999 CE. In the offset from UTC, the time-zone hour field is limited to -12:00 to 14:00. Fractional seconds are limited to a precision of 0 to 9 digits. See "About Session Time Zones." |
|
The day and fractional seconds are limited by a precision of 0 to 9 digits each. The hour is limited to a range of 0 to 23, and minutes and seconds are limited to a range of 0 to 59. |
|
The year is limited by a precision of 0 to 9 digits, and the month is limited to a range of 0 to 11. |
|
|
|
|
|
Limited by the |
|
Limited by the |
|
Limit by the |
About Session Time Zones
If an xs:dateTime
value with no time zone is loaded into TIMESTAMP W TIME ZONE
or TIMESTAMP W LOCAL TIME ZONE
, then the time zone is set to the value of the sessionTimeZone
parameter, which defaults to the JVM time zone. Using Oracle XQuery for Hadoop, you can set the sessionTimeZone
property, as described in "Oracle Loader for Hadoop Configuration Properties and Corresponding %oracle-property Annotations."
Notes
With JDBC or OCI output modes, the Oracle Database Adapter loads data directly into the database table. It also creates a directory with the same name as the custom put
function name, under the query output directory. For example, if your query output directory is myoutput, and your custom function is myPut
, then the myoutput/myPut
directory is created.
For every custom Oracle Database Adapter put function, a separate directory is created. This directory contains output produced by the Oracle Loader for Hadoop job. When you use datapump
or text
output modes, the data files are written to this directory. The control and SQL scripts for loading the files are written to the _olh
subdirectory, such as myoutput/myPut/_olh
.
For descriptions of the generated files, see "Delimited Text Output Format" and "Oracle Data Pump Output Format."
These examples use the following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited:
mydata/visits1.log 2013-10-28T06:00:00, john, index.html, 200 2013-10-28T08:30:02, kelly, index.html, 200 2013-10-28T08:32:50, kelly, about.html, 200 2013-10-30T10:00:10, mike, index.html, 401 mydata/visits2.log 2013-10-30T10:00:01, john, index.html, 200 2013-10-30T10:05:20, john, about.html, 200 2013-11-01T08:00:08, laura, index.html, 200 2013-11-04T06:12:51, kelly, index.html, 200 2013-11-04T06:12:40, kelly, contact.html, 200
The examples also use the following file in HDFS, which contains anonymous page visits:
mydata/anonvisits.log 2011-10-30T10:01:01, index.html, 401 2011-11-04T06:15:40, contact.html, 401
This SQL command creates the VISITS
table in the Oracle database:
CREATE TABLE visits (time TIMESTAMP, name VARCHAR2(15), page VARCHAR2(15), code NUMBER)
Example 6-6 Loading All Columns
The first query loads all information related to the page visit (time of visit, user name, page visited, and status code) to the VISITS
table. For anonymous access, the user name is missing, therefore the query specifies ()
to insert a null into the table. The target table name, user name, password, and connection URL are specified with %oracle-property
annotations.
The example uses a clear-text user name and password, which is insecure but acceptable in a development environment. Oracle recommends that you use a wallet instead for security, especially in a production application. You can configure an Oracle wallet using either Oracle Loader for Hadoop properties or their equivalent %oracle-property
annotations. The specific properties that you must set are described in "Providing the Connection Details for Online Database Mode."
import module "oxh:text"; declare %oracle:put %oracle-property:targetTable('visits') %oracle-property:connection.user('db') %oracle-property:connection.password('password') %oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com') function local:myPut($c1, $c2, $c3, $c4) external; for $line in text:collection("mydata/*visits*.log") let $split := fn:tokenize($line, "\s*,\s*") return if (count($split) > 3) then local:myPut($split[1], $split[2], $split[3], $split[4]) else local:myPut($split[1], (), $split[2], $split[3])
The VISITS
table contains the following data after the query runs:
TIME NAME PAGE CODE ------------------------------ --------------- --------------- ---------- 30-OCT-13 10.00.01.000000 AM john index.html 200 30-OCT-13 10.05.20.000000 AM john about.html 200 01-NOV-13 08.00.08.000000 AM laura index.html 200 04-NOV-13 06.12.51.000000 AM kelly index.html 200 04-NOV-13 06.12.40.000000 AM kelly contact.html 200 28-OCT-13 06.00.00.000000 AM john index.html 200 28-OCT-13 08.30.02.000000 AM kelly index.html 200 28-OCT-13 08.32.50.000000 AM kelly about.html 200 30-OCT-13 10.00.10.000000 AM mike index.html 401 30-OCT-11 10.01.01.000000 AM index.html 401 04-NOV-11 06.15.40.000000 AM contact.html 401
Example 6-7 Loading Selected Columns
This example uses the %oracle:columns
annotation to load only the time
and name
columns of the table. It also loads only visits by john
.
The column names specified in %oracle:columns
are positionally correlated to the put function parameters. Data values provided for the $c1
parameter are loaded into the TIME
column, and data values provided for the $c2
parameter are loaded into the NAME
column.
import module "oxh:text"; declare %oracle:put %oracle:columns('time', 'name') %oracle-property:targetTable('visits') %oracle-property:connection.user('db') %oracle-property:connection.password('password') %oracle-property:connection.url('jdbc:oracle:thin:@//localhost:1521/orcl.example.com') function local:myPut($c1, $c2) external; for $line in text:collection("mydata/*visits*.log") let $split := fn:tokenize($line, "\s*,\s*") where $split[2] eq 'john' return local:myPut($split[1], $split[2])
If the VISITS
table is empty before the query runs, then it contains the following data afterward:
TIME NAME PAGE CODE ------------------------------ --------------- --------------- ---------- 30-OCT-13 10.00.01.000000 AM john 30-OCT-13 10.05.20.000000 AM john 28-OCT-13 06.00.00.000000 AM john
When you use the Oracle Database adapter of Oracle XQuery for Hadoop, you indirectly use Oracle Loader for Hadoop. Oracle Loader for Hadoop defines configuration properties that control various aspects of connecting to Oracle Database and writing data. Oracle XQuery for Hadoop supports many of these properties, which are listed in the last column of the table below.
You can specify these properties with the generic -conf
and -D
hadoop
command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Oracle Database adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop
command-line options.
Alternatively, you can specify these properties as Oracle Database adapter put function annotations with the %oracle-property
prefix. These annotations are listed in the second column of the table below. Annotations apply only to the particular Oracle Database adapter put function that contains them in its declaration.
For example, you can set the target table to VISITS
by adding the following lines to the configuration file, and identifying the configuration file with the -conf
option:
<property> <name>oracle.hadoop.loader.targetTable</name> <value>visits</value> </property>
You can also set the target table to VISITS
with the -D
option, using the same Oracle Loader for Hadoop property:
-D oracle.hadoop.loader.targetTable=visits
Both methods set the target table to VISITS
for all Oracle Database adapter put functions in your query.
Alternatively, this annotation sets the target table to VISITS
only for the particular put function that has the annotation in the declaration:
%oracle-property:connection.url('visits')
This flexibility is provided for convenience. For example, if a query has multiple Oracle Database adapter put functions, each writing to a different table in the same database, then the most convenient way to specify the necessary information is like this:
Use the oracle.hadoop.loader.connection.url
property in the configuration file to specify the database connection URL. Then identify the configuration file using the -conf
option. This option sets the same database connection URL for all Oracle Database adapter put functions in your query.
Set a different table name using the %oracle-property:targetTable
annotation in each Oracle Database adapter put function declaration.
The following table identifies the Oracle Loader for Hadoop properties and their equivalent Oracle XQuery for Hadoop annotations by functional category. Oracle XQuery for Hadoop supports only the Oracle Loader for Hadoop properties listed in this table.
Table 6-4 Configuration Properties and Corresponding %oracle-property Annotations
Category | Property | Annotation |
---|---|---|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
Connection |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
General |
|
|
Output |
|
|
Output |
|
|
Output |
|
|
Output |
|
|
Output |
|
|
Output |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
|
Sampler |
|
This adapter provides functions to read and write values stored in Oracle NoSQL Database.
This adapter is described in the following topics:
Built-in Functions for Reading from and Writing to Oracle NoSQL Database
Built-in Functions for Reading from and Writing to Oracle NoSQL Database using Table API
Custom Functions for Reading Values from Oracle NoSQL Database
Custom Functions for Retrieving Single Values from Oracle NoSQL Database
Custom Functions for Reading Values from Oracle NoSQL Database using Table API
Custom Functions for Reading Single Row from Oracle NoSQL Database using Table API
Custom Functions for Retrieving Single Values from Oracle NoSQL Database using Large Object API
Custom Functions for Writing Values to Oracle NoSQL Database using Table API
Custom Functions for Writing Values to Oracle NoSQL Database using Large Object API
Before you write queries that use the Oracle NoSQL Database adapter, you must configure Oracle XQuery for Hadoop to use your Oracle NoSQL Database server.
You must set the following:
The KVHOME
environment variable to the local directory containing the Oracle NoSQL database lib directory.
The oracle.kv.hosts
and oracle.kv.kvstore
configuration properties.
The OXH_SOLR_MR_HOME
environment variable to the local directory containing search-mr-<version>.jar
and search-mr-<version>-job.jar
, only when Tika parser is invoked. That is, only when kv:collection-tika()
or kv:get-tika()
functions are invoked or, %kv:collection('tika')
or %kv:get('tika')
annotations are used with external functions.
You can set the configuration properties using either the -D
or -conf
options in the hadoop
command when you run the query. See "Running Queries."
This example sets KVHOME
and uses the hadoop
-D
option in a query to set oracle.kv.kvstore
:
$ export KVHOME=/local/path/to/kvstore/ $ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.kv.hosts=example.com:5000 -D oracle.kv.kvstore=kvstore ./myquery.xq -output ./myoutput
This example sets OXH_SOLR_MR_HOME
environment variable when the Tika parser is invoked:
$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr
Note:
The HADOOP_CLASSPATH
environment variable or -libjars
command line option must not contain NoSQL DB jars.
See "Oracle NoSQL Database Adapter Configuration Properties."
To use the built-in functions in your query, you must import the Oracle NoSQL Database module as follows
import module "oxh:kv";
The Oracle NoSQL Database module contains the following functions:
Accesses a collection of values in the database. Each value is decoded as UTF-8 and returned as a string.
Signature
declare %kv:collection("text") function kv:collection-text($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as xs:string* external; declare %kv:collection("text") function kv:collection-text($parent-key as xs:string?, $depth as xs:int?) as xs:string* external; declare %kv:collection("text") function kv:collection-text($parent-key as xs:string?) as xs:string* external;
Parameters
See "Parameters." Omitting $subrange
is the same as specifying $subrange()
. Likewise, omitting $depth
is the same as specifying $depth()
.
Returns
One string for each value
Accesses a collection of values in the database. Each value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."
Signature
declare %kv:collection("avroxml") function kv:collection-avroxml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as element()* external; declare %kv:collection("avroxml") function kv:collection-avroxml($parent-key as xs:string?, $depth as xs:int?) as element()* external; declare %kv:collection("avroxml") function kv:collection-avroxml($parent-key as xs:string?) as element()* external;
Parameters
See "Parameters." Omitting $subrange
is the same as specifying $subrange()
. Likewise, omitting $depth
is the same as specifying $depth()
.
Returns
One XML element for each Avro record
Accesses a collection of values in the database. Each value is read as a sequence of bytes and parsed as XML.
Signature
declare %kv:collection("xml") function kv:collection-xml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external; declare %kv:collection("xml") function kv:collection-xml($parent-key as xs:string?, $depth as xs:int?) as document-node()* external; declare %kv:collection("xml") function kv:collection-xml($parent-key as xs:string?) as document-node()* external;
Parameters
See "Parameters." Omitting $subrange
is the same as specifying $subrange()
. Likewise, omitting $depth
is the same as specifying $depth()
.
Returns
One XML document for each value.
Accesses a collection of values in the database. Each value is read as XDK binary XML and returned as an XML document.
Signature
declare %kv:collection("binxml") function kv:collection-binxml($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external;
declare %kv:collection("binxml") function kv:collection-binxml($parent-key as xs:string?, $depth as xs:int?) as document-node()* external;
declare %kv:collection("binxml") function kv:collection-binxml($parent-key as xs:string?) as document-node()* external;
Parameters
See "Parameters." Omitting $subrange
is the same as specifying $subrange()
. Likewise, omitting $depth
is the same as specifying $depth()
.
Returns
One XML document for each value.
Uses Tika to parse the specified value when invoked and returns as a document node.
Signature
declare %kv:collection("tika") function kv:collection-tika($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) $contentType as xs:string?) as document-node()* external;
Parameters
See "Parameters." Omitting $subrange
is the same as specifying $subrange()
. Likewise, omitting $depth
is the same as specifying $depth()
.
Returns
One document node for each value.
Writes a key-value pair. The $value
is encoded as UTF-8.
Signature
declare %kv:put("text") function kv:put-text($key as xs:string, $value as xs:string) external;
Writes a key/value pair. The $xml
is serialized and encoded as UTF-8.
Signature
declare %kv:put("xml") function kv:put-xml($key as xs:string, $xml as node()) external;
Puts a key/value pair. The $xml
is encoded as XDK binary XML. See Oracle XML Developer's Kit Programmer's Guide.
Signature
declare %kv:putkv:put-binxml("binxml") function ($key as xs:string, $xml as node()) external;
Obtains the value associated with the key. The value is decoded as UTF-8 and returned as a string.
Signature
declare %kv:get("text") function kv:get-text($key as xs:string) as xs:string? external;
Obtains the value associated with the key. The value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records.".
Signature
declare %kv:get("avroxml") function kv:get-avroxml($key as xs:string) as element()? external;
Obtains the value associated with the key. The value is read as a sequence of bytes and parsed as XML.
Signature
declare %kv:get("xml") function kv:get-xml($key as xs:string) as document-node()? external;
Obtains the value associated with the key. The value is read as XDK binary XML and returned as an XML document.
Signature
declare %kv:get("binxml") function kv:get-binxml($key as xs:string) as document-node()? external;
Obtains the value associated with the key. The value is parsed as byte array and returned as a document node.
Signature
declare %kv:get("tika") function kv:get-tika($key as xs:string, $contentType as xs:string?) as document-node()? external;
Defines a prefix range. The prefix defines both the lower and upper inclusive boundaries.
Use this function as the subrange argument of a kv:collection
function.
Signature
kv:key-range($prefix as xs:string) as xs:string;
Specifies a key range.
Use this function as the subrange argument of a kv:collection
function.
Signature
kv:key-range($start as xs:string, $start-inclusive as xs:boolean, $end as xs:string, $end-inclusive as xs:boolean) as xs:string;
Parameters
$start
: Defines the lower boundary of the key range.
$start-inclusive
: A value of true
includes $start
in the range, or false
omits it.
$end
: Defines the upper boundary of the key range. It must be greater than $start
.
$end-inclusive
: A value of true
includes $end
in the range, or false omits it.
To use the built-in functions in your query, you must have declared the name space and imported the module as follows:
declare namespace kv-table = "oxh:kv-table"; import module "oxh:kv-table";
The Oracle NoSQL Database through Table API module contains the following functions:
These functions iterate over all or a subset of rows stored in a single table in the NoSQL Database. Each row is returned in a form of a JSON string.
Signature
declare %kv-table:collection-jsontext("jsontext") function kv-table:collection-jsontext($tableName as xs:string) as xs:string*
declare %kv-table:collection(“jsontext") function kv-table:collection-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string*
declare %kv-table:collection(“jsontext") function kv-table:collection-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string?, $fieldRangeJsonValue as xs:string?) as xs:string*
Parameters
$tableName
as xs:string
– name of the table in NoSQL Database
$primaryKeyJsonValue
as xs:string?
– a partial primary key specified as JSON text
$fieldRangeJsonValue
as xs:string?
– field range for a remaining field of the given primary key specified as JSON text
{ "name": “fieldname", "start": “startVal", "startInclusive": true|false, "end" : "endVal", "endInclusive": true|false }
Returns
JSON value of each row
Use "json:parse-as-xml" function to parse JSON string into an XML document
This function reads a single row stored in a table in NoSQL Database. The row is returned in a form of a JSON string. If the row is not found, then an empty sequence is returned.
Signature
declare %kv-table:get(“jsontext") function kv-table:get-jsontext($tableName as xs:string, $primaryKeyJsonValue as xs:string) as xs:string?
Parameters
$tableName
as xs:string
– name of the table in NoSQL Database
$primaryKeyJsonValu
e as xs:string?
– a full primary key specified as JSON text
See Also:
http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/primaryshardkeys.html#primarykeys
Returns
JSON value of the row or an empty sequence, if the row is not found.
Use "json:parse-as-xml" function to parse JSON string into an XML document
This function writes a row into NoSQL Database using its Table API
Signature
declare %kv-table:put(“jsontext") function kv-table:put-jsontext($tableName as xs:string, $jsonValue as xs:string);
Parameters
$tableName
as xs:string
– name of the table in NoSQL Database
$jsonValue
as xs:string
– row specified as JSON text
To use the built-in functions in your query you must have declared the name space and imported the module as follows:
declare namespace kv-lob = "oxh:kv-lob"; import module "oxh:kv-lob";
The Oracle NoSQL Database through Large Object API module contains the following functions:
Obtains the value associated with the key. The value is decoded as UTF-8 and returned as a string.
Signature
declare %kv-lob:get("text") function kv-lob:get-text($key as xs:string) as xs:string?
Obtains the value associated with the key. The value is read as a sequence of bytes and parsed as XML.
Signature
declare %kv-lob:get("xml") function kv-lob:get-xml($key as xs:string) as document-node()?
Obtains the value associated with the key. The value is read as XDK binary XML and returned as an XML document. See Oracle XML Developer's Kit Programmer's Guide.
Signature
declare %kv-lob:get("binxml") function kv-lob:get-binxml($key as xs:string) as document-node()?
Obtains the value associated with the key. The value is parsed as byte array and returned as a document node.
Signature
declare %kv-lob:get("tika") function kv-lob:get-tika($key as xs:string) as document-node()?
declare %kv-lob:get("tika") function kv-lob:get-tika($key as xs:string, $contentType as xs:string?) as document-node()?
Writes a key-value pair. The $value
is encoded as UTF-8.
Signature
declare %kv-lob:put("text") function kv-lob:put-text($key as xs:string, $value as xs:string)
Writes a key/value pair. The $xml
is serialized and encoded as UTF-8.
Signature
declare %kv-lob:put("xml") function kv-lob:put-xml($key as xs:string, $document as node())
You can use the following functions to read values from Oracle NoSQL Database. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading collections of NoSQL values must have one of the following signatures:
declare %kv:collection("text") [additional annotations] function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as xs:string* external; declare %kv:collection(["xml"|"binxml"|"tika"]) [additional annotations] function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?) as document-node()* external; declare %kv:collection("tika") [additional annotations] function local:myFunctionName($parent-key as xs:string?, $depth as xs:int?, $subrange as xs:string?, $contentType as xs:string?) as document-node()* external;
Annotations
Declares the NoSQL Database collection function. Required.
The method parameter is one of the following values:
avroxml
: Each value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."
binxml
: Each value is read as XDK binary XML and returned as an XML document.
text
: Each value is decoded using the character set specified by the %output:encoding
annotation.
tika
: Each value is parsed by Tika, and returned as a document node.
xml
: Each value is parsed as XML, and returned as an XML document.
Controls whether the key of a key-value pair is set as the document-uri
of the returned value. Specify true
to return the key.
The default setting is true when method is xml
, avroxml
, or binxml
, and false
when it is text. Text functions with this annotation set to true
must be declared to return text()?
instead of xs:string?
. Atomic xs:string
values are not associated with a document node, but text nodes are. For example:
declare %kv:collection("text") %kv:key("true") function local:col($parent-key as xs:string?) as text()* external;
When the key is returned, you can obtain its string representation by using the kv:key()
function. For example:
for $value in local:col(...) let $key := $value/kv:key() return ...
Specifies the Avro reader schema. This annotation is valid only when method is avroxml
. Optional.
The schema-name is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. The record value is mapped to the reader schema. For example, %avro:schema-kv("org.example.PersonRecord")
.
See Also:
For information about Avro schemas, the Oracle NoSQL Database Getting Started Guide at
http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/schemaevolution.html
Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM.
This annotation currently only applies to the text method. For XML files, the document's encoding declaration is used if it is available.
See Also:
"Supported Encodings" in the Oracle Java SE documentation at
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
Parameters
Specifies the parent key whose child KV pairs are returned by the function. The major key path must be a partial path and the minor key path must be empty. An empty sequence results in fetching all keys in the store.
See Also:
For the format of the key, Oracle NoSQL Database Java Reference at
http://docs.oracle.com/cd/NOSQL/html/javadoc/oracle/kv/Key.html#toString
Specifies whether parents, children, descendants, or a combination are returned. The following values are valid:
kv:depth-parent-and-descendants()
: Selects the parents and all descendants.
kv:depth-children-only()
: Selects only the immediately children, but not the parent.
kv:depth-descendants-only()
: Selects all descendants, but not the parent.
kv:depth-parent-and-children()
: Selects the parent and the immediate children.
An empty sequence implies kv:depth-parent-and-descendants()
.
This example selects all the descendants, but not the parent:
kv:collection-text("/parent/key", kv:depth-descendants-only(), ...
Specifies a subrange to further restrict the range under parentKey
to the major path components. The format of the string is:
<startType>/<start>/<end>/<endType>
The startType
and endType
are either I
for inclusive or E
for exclusive.
The start
and end
are the starting and ending key strings.
If the range does not have a lower boundary, then omit the leading startType/start
specification from the string representation. Similarly, if the range does not have an upper boundary, then omit the trailing end/endType
specification. A KeyRange
requires at least one boundary, thus at least one specification must appear in the string representation.
The kv:key-range
function provides a convenient way to create a range string.
The value can also be the empty sequence.
The following examples are valid subrange specifications:
Example | Description |
---|---|
|
From alpha inclusive to beta exclusive |
|
From "" exclusive to 0123 inclusive |
|
From chi inclusive to infinity |
|
From "" exclusive to infinity |
|
From negative infinity to chi exclusive |
|
From negative infinity to "" inclusive |
The Oracle NoSQL Database adapter has get functions, which enable you to retrieve a single value from the database. Unlike collection functions, calls to get functions are not distributed across the cluster. When a get function is called, the value is retrieved by a single task.
Signature
Custom get
functions must have one of the following signatures:
declare %kv:get("text") [additional annotations] function local:myFunctionName($key as xs:string) as xs:string? external; declare %kv:get("avroxml") [additional annotations] function local:myFunctionName($key as xs:string) as element()? external; declare %kv:get(["xml"|"binxml"|"tika"]) [additional annotations] function local:myFunctionName($key as xs:string) as document-node()? declare %kv:get(["tika"]) [additional annotations] function local:myFunctionName($key as xs:string $contentType as xs:string?) as document-node()?
Annotations
Declares the NoSQL Database get function. Required.
The method parameter is one of the following values:
avroxml
: The value is read as an Avro record and returned as an XML element. The records are converted to XML as described in "Reading Records."
binxml
: The value is read as XDK binary XML and returned as an XML document.
text
: The value is decoded using the character set specified by the %output:encoding
annotation.
tika
: Each value is parsed by Tika, and returned as a document node.
xml
: The value is parsed as XML and returned as an XML document.
Controls whether the key of a key-value pair is set as the document-uri
of the returned value. Specify true
to return the key.
The default setting is true
when method is xml
, avroxml
, or binxml
, and false
when it is text. Text functions with this annotation set to true
must be declared to return text()?
instead of xs:string?
. Atomic xs:string
values are not associated with a document node, but text nodes are.
When the key is returned, you can obtain its string representation by using the kv:key()
function.
Specifies the Avro reader schema. This annotation is valid only when method is avroxml
. Optional.
The schema-name is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. The record value is mapped to the reader schema. For example, %avro:schema-kv("org.example.PersonRecord")
.
See Also:
For information about Avro schemas, the Oracle NoSQL Database Getting Started Guide at
http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/schemaevolution.html
Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM.
This annotation currently only applies to the text method. For XML files, the document encoding declaration is used, if it is available.
See Also:
"Supported Encodings" in the Oracle Java SE documentation at
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
You can use the following functions to read values from Oracle NoSQL Database using Table API. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading collections of NoSQL values using Table API must have one of the following signatures:
declare %kv-table:collection(“jsontext") function local:myFunctionName($tableName as xs:string) as xs:string* external;
declare %kv-table:collection(“jsontext") function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string* external;
declare %kv-table:collection(“jsontext") function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?, $fieldRangeJsonValue as xs:string?) as xs:string* external;
Annotations
Parameters
Same as "Parameters."
Returns
Same as "Returns."
You can use the following functions to read single row from Oracle NoSQL Database using Table API. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions to read single row from Oracle NoSQL Database using Table API must have one of the following signatures:
declare %kv-table:get(“jsontext") function local:myFunctionName($tableName as xs:string, $primaryKeyJsonValue as xs:string?) as xs:string? external;
Annotations
Parameters
Same as "Parameters."
Returns
Same as "Returns."
You can use the following functions to read values from Oracle NoSQL Database using Large Object API. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading single values using Large Object API must have one of the following signatures:
declare %kv-lob:get("text") [additional annotations] function local:myFunctionName($key as xs:string) as xs:string? external;
declare %kv-lob:get(["xml"|"binxml"|"tika"]) [additional annotations] function local:myFunctionName($key as xs:string) as document-node()?
declare %kv-lob:get(["tika"]) [additional annotations] function local:myFunctionName($key as xs:string $contentType as xs:string?) as document-node()?
Annotations
Declares the NoSQL Database get function that uses Large Object API. Required. Supported method parameters are binxml
, text
, tika
, and xml
– same as in %kv:get(“method").
Note:
avroxml
method is not supported with Large Object API.
Controls whether the key of a key-value pair is set as the document-uri of the returned value. Specify true to return the key. Same as %kv:key().
Specifies the character encoding of text values. UTF-8 is assumed when this annotation is not used. The valid encodings are those supported by the JVM. This annotation currently only applies to the text method. For XML files, the document encoding declaration is used, if it is available.
You can use the following annotations to define functions that write to Oracle NoSQL Database.
Signature
Custom functions for writing to Oracle NoSQL Database must have one of the following signatures:
declare %kv:put("text") function local:myFunctionName($key as xs:string, $value as xs:string) external; declare %kv:put(["xml"|"binxml"|"avroxml"]) function local:myFunctionName($key as xs:string, $xml as node()) external;
Annotations
Annotation | Description |
---|---|
%kv:put("method") |
Declares the NoSQL Database module put function. Required. The method determines how the value is stored. It must be one of the following values:
|
%avro:schema-kv("schema-name") |
Specifies the record schema of the values to be written. The annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. For example: |
%output:* |
A standard XQuery serialization parameter for the output method (text or XML) specified in |
You can use the following annotations to define functions that write to Oracle NoSQL Database using Table API.
Signature
Custom functions for writing rows using Table API must have one of the following signatures:
declare %kv-table:put(“jsontext") function local:myFunctionName($tableName as xs:string, $jsonValue as xs:string?) external;
Annotations
Parameters
Same as "Parameters."
You can use the following annotations to define functions that write to Oracle NoSQL Database using Large Object API.
Signature
Custom functions for writing values using Large Object API must have one of the following signatures:
declare %kv-lob:put("text") function local:myFunctionName($key as xs:string, $value as xs:string) external;
declare %kv-lob:put(["xml"|"binxml"]) function local:myFunctionName($key as xs:string, $xml as node()) external;
Annotations
Declares the NoSQL Database put function. Required. Supported method parameters are binxml
, text
, and xml
– same as in "%kv:put("method")"
Note:
avroxml
method is not supported with Large Object API.
A standard XQuery serialization parameter for the output method (text or XML) specified in %kv-lob:put
. See "Serialization Annotations."
Example 6-8 Writing and Reading Text in Oracle NoSQL Database
This example uses the following text file is in HDFS. The file contains user profile information such as user ID, full name, and age, separated by colons (:).
mydata/users.txt john:John Doe:45 kelly:Kelly Johnson:32 laura:Laura Smith: phil:Phil Johnson:27
The first query stores the lines of this text file in Oracle NoSQL Database as text values.
import module "oxh:text"; import module "oxh:kv"; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $key := "/users/text/" || $split[1] return kv:put-text($key, $line)
The next query reads the values from the database:
import module "oxh:text"; import module "oxh:kv"; for $value in kv:collection-text("/users/text") let $split := fn:tokenize($value, ":") where $split[2] eq "Phil Johnson" return text:put($value)
The query creates a text file that contains the following line:
phil:Phil Johnson:27
Example 6-9 Writing and Reading Avro in Oracle NoSQL Database
In this example, the following Avro schema is registered with Oracle NoSQL Database:
{ "type": "record", "name": "User", "namespace": "com.example", "fields" : [ {"name": "id", "type": "string"}, {"name": "full_name", "type": "string"}, {"name": "age", "type": ["int", "null"] } ] }
The next query writes the user names to the database as Avro records.
import module "oxh:text"; declare %kv:put("avroxml") %avro:schema-kv("com.example.User") function local:put-user($key as xs:string, $value as node()) external; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $id := $split[1] let $key := "/users/avro/" || $id return local:put-user( $key, <user> <id>{$id}</id> <full_name>{$split[2]}</full_name> { if ($split[3] castable as xs:int) then <age>{$split[3]}</age> else () } </user> )
This query reads the values from the database:
import module "oxh:text"; import module "oxh:kv"; for $user in kv:collection-avroxml("/users/avro") where $user/age gt 30 return text:put($user/full_name)
The query creates a text files with the following lines:
John Doe Kelly Johnson
Example 6-10 Storing XML in NoSQL Database
The following query uses the XML files shown in Example 6-24 of "Examples of XML File Adapter Functions" as input. It writes each comment element as an Oracle NoSQL Database value:
import module "oxh:xmlf"; import module "oxh:kv"; for $comment in xmlf:collection("mydata/comments*.xml")/comments/comment let $key := "/comments/" || $comment/@id return kv:put-xml($key, $comment)
The query writes the five comment
elements as XML values in Oracle NoSQL Database.
For very large XML files, modify the query as follows to improve performance and disk space consumption:
Use the following for
clause, which causes each XML file to be split and processed in parallel by multiple tasks:
for $comment in xmlf:collection("mydata/comments*.xml", "comment")
In the return clause, use kv:put-binxml
instead of kv:put-xml
to store the values as binary XML instead of plain text.
Use the kv:collection-xml
function to read the values in the database. For example:
import module "oxh:text"; import module "oxh:kv"; for $comment in kv:collection-xml("/comments")/comment return text:put($comment/@id || " " || $comment/@user)
The query creates text files that contain the following lines:
12345 john 23456 john 54321 mike 56789 kelly 87654 mike
Example 6-11 Storing XML as Avro in Oracle NoSQL Database
This example converts the XML values to Avro before they are stored.
Add the following Avro schema to Oracle NoSQL Database:
{ "type": "record", "name": "Comment", "namespace": "com.example", "fields" : [ {"name": "cid", "type": "string"}, {"name": "user", "type": "string"}, {"name": "content", "type": "string"}, {"name": "likes", "type" : { "type" : "array", "items" : "string" } } ] }
The following query writes five comment
elements as Avro values in Oracle NoSQL Database:
import module "oxh:xmlf"; import module "oxh:kv"; declare %kv:put("avroxml") %avro:schema-kv("com.example.Comment") function local:put-comment($key as xs:string, $value as node()) external; for $comment in xmlf:collection("mydata/comments*.xml", "comment") let $key := "/comments/" || $comment/@id let $value := <comment> <cid>{$comment/@id/data()}</cid> <user>{$comment/@user/data()}</user> <content>{$comment/@text/data()}</content> <likes>{ for $like in $comment/like return <oxh:item>{$like/@user/data()}</oxh:item> }</likes> </comment> return local:put-comment($key, $value)
Use the kv:collection-avroxml
function to read the values in the database. For example:
import module "oxh:text"; import module "oxh:kv"; for $comment in kv:collection-avroxml("/comments") return text:put($comment/cid || " " || $comment/user || " " || count($comment/likes/*))
The query creates text files that contain the following lines:
12345 john 0 23456 john 2 54321 mike 1 56789 kelly 2 87654 mike 0
Example 6-12 Reading and writing data using Oracle NoSQL Database Table API
This example uses the following text file is in HDFS. The file contains user profile information such as user ID, full name, and age, separated by colons (:).
mydata/users.txt john:John Doe:45 kelly:Kelly Johnson:32 laura:Laura Smith: phil:Phil Johnson:27
Let us create a table called users in NoSQL DB as follows:
CREATE TABLE users (id STRING, name STRING, age INTEGER, PRIMARY KEY (id));
The first query stores users age into this table.
import module "oxh:text"; import module "oxh:kv-table"; for $line in text:collection("mydata/users.txt") let $split := tokenize($line, ":") let $id := $split[1] let $name := $split[2] let $age := $split[3] where string-length($age) gt 0 let $row := '{' || '"id":"' || $id || '",' || '"name":"' || $name || '",' || '"age":' || $age || '}' return kv-table:put-jsontext(“users", $row)
After running this query the table contains the following records:
Id | name | age |
---|---|---|
john |
John Doe |
45 |
kelly |
Kelly Johnson |
32 |
phil |
Phil Johnson |
27 |
The second query reads row from the table and returns ids of users whose name ends with Johnson.
import module "oxh:text "; import module "oxh:json"; import module "oxh:kv-table"; for $row in kv-table:collection("users") let $user := json:parse-as-xml($row) let $id := $user/json:get(“id") let $name := $user/json:get(“name") where ends-with($name, “Johnson") return text:put($id)
The query creates a text file that contains the following lines:
kelly phil
Example 6-13 Reading data using Oracle NoSQL Database Large Object API
Assuming Oracle NoSQL Database contains the following information:
Table userImages
CREATE TABLE userImages (imageFileName STRING, imageVersion STRING, imageDescription INTEGER, PRIMARY KEY (imageFileName))
imageFileName | imageVersion | imageDescription |
---|---|---|
IMG_001.JPG |
1 |
Sunrise |
IMG_002.JPG |
1 |
Sunrise |
Key/Value data loaded with Large Object API where:
Key is the lob/imageFileName/image.lob
Value is a JPEG image data that contains geolocation metadata in EXIF
format
The following query extracts that metadata and converts it to CSV format as imageFileName, latitude, and longitude.
import module “oxh:kv-table"; import module “oxh:kv-lob"; import module "oxh:tika"; import module "oxh:json"; import module "oxh:text "; for $row in kv-table:collection("userImages") let $imageFileName := json:parse-as-xml($row)/json:get(“imageFileName") let $imageKey := “lob/" || $imageFileName || “/image.lob" let $doc := kv-lob:get-tika($imageKey, “image/jpeg") let $lat := $doc/tika:metadata/tika:property[@name eq "GPS Latitude"] let $lon := $doc/tika:metadata/tika:property[@name eq "GPS Longitude"] where exists($lat) and exists($lon) return text:put($imageFileName || "," || $lat || "," || $lon)
Oracle XQuery for Hadoop uses the generic options for specifying configuration properties in the Hadoop command. You can use the -conf
option to identify configuration files, and the -D
option to specify individual properties. See "Running Queries."
You can set various configuration properties for the Oracle NoSQL Database adapter that control the durability characteristics and timeout periods. You must set oracle.kv.hosts
and oracle.kv.kvstore
.The following properties configure the Oracle NoSQL Database adapter.
Property | Description |
---|---|
oracle.hadoop.xquery.kv.config.durability |
Type: String Default Value: Description: Defines the durability characteristics associated with MasterPolicy, ReplicaPolicy, ReplicaAck
|
oracle.hadoop.xquery.kv.config.requestLimit |
Type: Comma-separated list of integers Default Value: 100, 90, 80 Description: Limits the number of simultaneous requests to prevent nodes with long service times from consuming all threads in the KV store client. The value consists of three integers, which you specify in order and separate with commas: maxActiveRequests, requestThresholdPercent, nodeLimitPercent
|
oracle.hadoop.xquery.kv.config.requestTimeout |
Type: Long Default Value: 5000 ms Description: Configures the request timeout period in milliseconds. The value must be greater than zero (0). |
oracle.hadoop.xquery.kv.config.socketOpenTimeout |
Type: Long Default Value: 5000 ms Description: Configures the open timeout used when establishing sockets for client requests, in milliseconds. Shorter timeouts result in more rapid failure detection and recovery. The default open timeout is adequate for most applications. The value must be greater than zero (0). |
oracle.hadoop.xquery.kv.config.socketReadTimeout |
Type: Long Default Value: 30000 ms Description: Configures the read timeout period associated with the sockets that make client requests, in milliseconds. Shorter timeouts result in more rapid failure detection and recovery. Nonetheless, the timeout period should be sufficient to allow the longest timeout associated with a request. |
oracle.kv.batchSize |
Type: Key Default Value: Not defined Description: The desired number of keys for the InputFormat to fetch during each network round trip. A value of zero (0) sets the property to a default value. |
oracle.kv.consistency |
Type: Consistency Default Value: Description: The consistency guarantee for reading child key-value pairs. The following keywords are valid values:
|
oracle.kv.hosts |
Type: String Default Value: Not defined Description: An array of one or more hostname:port pairs that identify the hosts in the KV store with the source data. Separate multiple pairs with commas. |
oracle.kv.kvstore |
Type: String Default Value: Not defined Description: The name of the KV store with the source data. |
oracle.kv.timeout |
Type: Long Default Value: Not defined Description: Sets a maximum time interval in milliseconds for retrieving a selection of key-value pairs. A value of zero (0) sets the property to its default value. |
oracle.hadoop.xquery.kv.config.LOBSuffix |
Type: String Default Value: .lob Description: Configures the default suffix associated with LOB keys. |
oracle.hadoop.xquery.kv.config.LOBTimeout |
Necessary or FYI? Also, hard-coded link. |
oracle.hadoop.xquery.kv.config.readZones |
Type: Comma separated list of strings Default Value: Not defined Description: Sets the zones in which nodes must be located to be used for read operations. |
oracle.hadoop.xquery.kv.config.security |
Type: String Default Value: Not defined Description: Configures security properties for the client. |
The sequence file adapter provides functions to read and write Hadoop sequence files. A sequence file is a Hadoop-specific file format composed of key-value pairs.
The functions are described in the following topics:
To use the built-in functions in your query, you must import the sequence file module as follows:
import module "oxh:seq";
The sequence file module contains the following functions:
For examples, see "Examples of Sequence File Adapter Functions."
Accesses a collection of sequence files in HDFS and returns the values as strings. The files may be split up and processed in parallel by multiple tasks.
Signature
declare %seq:collection("text") function seq:collection($uris as xs:string*) as xs:string* external;
Parameters
$uris
: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable values, the bytes are converted to a string using a UTF-8 decoder.
Returns
One string for each value in each file.
Accesses a collection of sequence files in HDFS, parses each value as XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
Signature
declare %seq:collection("xml") function seq:collection-xml($uris as xs:string*) as document-node()* external;
Parameters
$uris
: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For BytesWritable
values, the XML document encoding declaration is used, if it is available.
Returns
One XML document for each value in each file. See "Tika Parser Output Format."
Accesses a collection of sequence files in the HDFS, reads each value as binary XML, and returns it. Each file may be split up and processed in parallel by multiple tasks.
Signature
declare %seq:collection("binxml") function seq:collection-binxml($uris as xs:string*) as document-node()* external;
Parameters
$uris
: The sequence file URIs. The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The bytes are decoded as binary XML.
Returns
One XML document for each value in each file.
Notes
You can use this function to read files that were created by seq:put-binxml
in a previous query. See "seq:put-binxml."
Uses Tika to parse the sequence files in the HDFS. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. For each value a document node returned produced by Tika.
Signature
declare %seq:collection("tika") function seq:collection-tika($uris as xs:string*) as document-node()* external; declare %seq:collection("tika") function seq:collection-tika($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
Parameters
$uris
: The sequence file URIs. The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. Tika library automatically detects character encoding. Alternatively, the encoding can be passed in $contentType parameter as charset attribute.
$contentType
: Specifies the media type of the content to parse, and may have the charset attribute.
Returns
One document node for each value in each file.
Writes either the string value or both the key and string value of a key-value pair to a sequence file in the output directory of the query.
This function writes the keys and values as org.apache.hadoop.io.Text
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.Text
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
Signature
declare %seq:put("text") function seq:put($key as xs:string, $value as xs:string) external; declare %seq:put("text") function seq:put($value as xs:string) external;
Parameters
$key
: The key of a key-value pair
$value
: The value of a key-value pair
Returns
empty-sequence()
Notes
The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
Writes either an XML value or a key and XML value to a sequence file in the output directory of the query.
This function writes the keys and values as org.apache.hadoop.io.Text
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.Text
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
Signature
declare %seq:put("xml") function seq:put-xml($key as xs:string, $xml as node()) external; declare %seq:put("xml") function seq:put-xml($xml as node()) external;
Parameters
$key
: The key of a key-value pair
$value
: The value of a key-value pair
Returns
empty-sequence()
Notes
The values are spread across one or more sequence files. The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with "part," such as part-m-00000. You specify the output directory when the query executes. See "Running Queries."
Encodes an XML value as binary XML and writes the resulting bytes to a sequence file in the output directory of the query. The values are spread across one or more sequence files.
This function writes the keys as org.apache.hadoop.io.Text
and the values as org.apache.hadoop.io.BytesWritable
.
When the function is called without the $key
parameter, it writes the values as org.apache.hadoop.io.BytesWritable
and sets the key class to org.apache.hadoop.io.NullWritable
, because there are no key values.
Signature
declare %seq:put("binxml") function seq:put-binxml($key as xs:string, $xml as node()) external; declare %seq:put("binxml") function seq:put-binxml($xml as node()) external;
Parameters
$key
: The key of a key-value pair
$value
: The value of a key-value pair
Returns
empty-sequence()
Notes
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
You can use the seq:collection-binxml
function to read the files created by this function. See "seq:collection-binxml."
You can use the following annotations to define functions that read collections of sequence files. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading sequence files must have one of the following signatures:
declare %seq:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %seq:collection(["xml"|"binxml"|"tika"]) [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external; declare %seq:collection(["tika"]) [additional annotations] function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
Annotations
Declares the sequence file collection function, which reads sequence files. Required.
The optional method parameter can be one of the following values:
text
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. Bytes are decoded using the character set specified by the %output:encoding
annotation. They are returned as xs:string
. Default.
xml
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. The values are parsed as XML and returned by the function.
binxml
: The values in the sequence files must be org.apache.hadoop.io.BytesWritable
. The values are read as XDK binary XML and returned by the function.
tika
: The values in the sequence files must be either org.apache.hadoop.io.Text
or org.apache.hadoop.io.BytesWritable
. The values are parsed by Tika and returned by the function.
Specifies the character encoding of the input values. The valid encodings are those supported by the JVM. UTF-8 is the default encoding.
See Also:
"Supported Encodings" in the Oracle Java SE documentation at
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
Controls whether the key of a key-value pair is set as the document-uri
of the returned value. Specify true
to return the keys. The default setting is true
when method is binxml
or xml
, and false
when it is text
.
Text functions with this annotation set to true
must return text()*
instead of xs:string*
because atomic xs:string
is not associated with a document.
When the keys are returned, you can obtain their string representations by using seq:key
function.
This example returns text instead of string values because %seq:key
is set to true
.
declare %seq:collection("text") %seq:key("true") function local:col($uris as xs:string*) as text()* external;
The next example uses the seq:key
function to obtain the string representations of the keys:
for $value in local:col(...) let $key := $value/seq:key() return . . .
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-max(1024) %seq:split-max("1024") %seq:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%seq:split-min(1024) %seq:split-min("1024") %seq:split-min("1K")
You can use the following annotations to define functions that write collections of sequence files in HDFS.
Signature
Custom functions for writing sequence files must have one of the following signatures. You can omit the $key
argument when you are not writing a key value.
declare %seq:put("text") [additional annotations] function local:myFunctionName($key as xs:string, $value as xs:string) external; declare %seq:put(["xml"|"binxml"]) [additional annotations] function local:myFunctionName($key as xs:string, $xml as node()) external;
Annotations
Declares the sequence file put function, which writes key-value pairs to a sequence file. Required.
If you use the $key
argument in the signature, then the key is written as org.apache.hadoop.io.Text
. If you omit the $key
argument, then the key class is set to org.apache.hadoop.io.NullWritable
.
Set the method parameter to text
, xml
, or binxml
. The method determines the type used to write the value:
text
: String written as org.apache.hadoop.io.Text
xml
: XML written as org.apache.hadoop.io.Text
binxml
: XML encoded as XDK binary XML and written as org.apache.hadoop.io.BytesWritable
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before Codec
(case insensitive)
Set the compressionType parameter to one of these values:
block
: Keys and values are collected in groups and compressed together. Block compression is generally more compact, because the compression algorithm can take advantage of similarities among different values.
record
: Only the values in the sequence file are compressed.
All of these examples use the default codec and block compression:
%seq:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %seq:compress("DefaultCodec", "block") %seq:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %seq:put
. See "Serialization Annotations."
See Also:
SequenceFile at https://wiki.apache.org/hadoop/SequenceFile
"The Influence of Serialization Parameters" sections for XML and text output methods inXSLT and XQuery Serialization 3.1 at http://www.w3.org/TR/xslt-xquery-serialization-31/
These examples queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.
mydata/comments1.xml <comments date="2013-12-30"> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> </comments> mydata/comments2.xml <comments date="2013-12-31"> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments> mydata/comments3.xml <comments date="2014-01-01"> <comment id="87654" user="mike" text="I don't feel so good."/> <comment id="23456" user="john" text="What a beautiful day!"> <like user="kelly"/> <like user="phil"/> </comment> </comments>
Example 6-14
The following query stores the comment
elements in sequence files.
import module "oxh:seq"; import module "oxh:xmlf"; for $comment in xmlf:collection("mydata/comments*.xml", "comment") return seq:put-xml($comment)
Example 6-15
The next query reads the sequence files generated by the previous query, which are stored in an output directory named myoutput
. The query then writes the names of users who made multiple comments to a text file.
import module "oxh:seq"; import module "oxh:text"; for $comment in seq:collection-xml("myoutput/part*")/comment let $user := $comment/@user group by $user let $count := count($comment) where $count gt 1 return text:put($user || " " || $count)
The text file created by the previous query contain the following lines:
john 2 mike 2
See "XML File Adapter."
Example 6-16
The following query extracts comment
elements from XML files and stores them in compressed sequence files. Before storing each comment, it deletes the id
attribute and uses the value as the key in the sequence files.
import module "oxh:xmlf"; declare %seq:put("xml") %seq:compress("default", "block") %seq:file("comments") function local:myPut($key as xs:string, $value as node()) external;
for $comment in xmlf:collection("mydata/comments*.xml", "comment") let $id := $comment/@id let $newComment := copy $c := $comment modify delete node $c/@id return $c return local:myPut($id, $newComment)
Example 6-17
The next query reads the sequence files that the previous query created in an output directory named myoutput
. The query automatically decompresses the sequence files.
import module "oxh:text"; import module "oxh:seq"; for $comment in seq:collection-xml("myoutput/comments*")/comment let $id := $comment/seq:key() where $id eq "12345" return text:put-xml($comment)
The query creates a text file that contains the following line:
<comment id="12345" user="john" text="It is raining :( "/>
This adapter provides functions to create full-text indexes and load them into Apache Solr servers. These functions call the Solr org.apache.solr.hadoop.MapReduceIndexerTool
at run time to generate a full-text index on HDFS and optionally merge it into Solr servers. You can declare and use multiple custom put functions supplied by this adapter and the built-in put function within a single query. For example, you can load data into different Solr collections or into different Solr clusters.
This adapter is described in the following topics:
The first time that you use the Solr adapter, ensure that Solr is installed and configured on your Hadoop cluster as described in "Installing Oracle XQuery for Hadoop".
Your Oracle XQuery for Hadoop query must use the following configuration properties or the equivalent annotation:
oracle.hadoop.xquery.solr.loader.zk-host
oracle.hadoop.xquery.solr.loader.collection
If the index is loaded into a live set of Solr servers, then this configuration property or the equivalent annotation is also required:
oracle.hadoop.xquery.solr.loader.go-live
You can set the configuration properties using either the -D
or -conf
options in the hadoop
command when you run the query. See "Running Queries" and "Solr Adapter Configuration Properties"
This example sets OXH_SOLR_MR_HOME
and uses the hadoop -D
option in a query to set the configuration properties:
$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr $ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.hadoop.xquery.solr.loader.zk-host=/solr -D oracle.hadoop.xquery.solr.loader.collection=collection1 -D oracle.hadoop.xquery.solr.loader.go-live=true ./myquery.xq -output ./myoutput
To use the built-in functions in your query, you must import the Solr module as follows:
import module "oxh:solr";
The Solr module contains the following functions:
The solr
prefix is bound to the oxh:solr
namespace by default.
Writes a single document to the Solr index.
This document XML format is specified by Solr at
https://wiki.apache.org/solr/UpdateXmlMessages
Signature
declare %solr:put function solr:put($value as element(doc)) external;
Parameters
$value
: A single XML element named doc
, which contains one or more field
elements, as shown here:
<doc> <field name="field_name_1">field_value_1</field> . . . <field name="field_name_N">field_value_N</field> </doc>
Returns
A generated index that is written into the output_dir
/solr-put
directory, where output_dir is the query output directory
You can use the following annotations to define functions that generate full-text indexes and load them into Solr.
Signature
Custom functions for generating Solr indexes must have the following signature:
declare %solr:put [additional annotations] function local:myFunctionName($value as node()) external;
Annotations
Declares the solr put function. Required.
Name of the subdirectory under the query output directory where the index files will be written. Optional, the default value is the function local name.
Controls various aspects of index generation. You can specify multiple %solr-property
annotations.
These annotations correspond to the command-line options of org.apache.solr.hadoop.MapReduceIndexerTool
. Each MapReduceIndexerTool?
option has an equivalent Oracle XQuery for Hadoop configuration property and a %solr-property
annotation. Annotations take precedence over configuration properties. See "Solr Adapter Configuration Properties" for more information about supported configuration properties and the corresponding annotations.
See Also:
For more information about MapReduceIndexerTool?
command line options, see Cloudera Search User Guide at
Parameters
$value
: An element or a document node conforming to the Solr XML syntax. See "solr:put" for details.
Example 6-18 Using the Built-in solr:put Function
This example uses the following HDFS text file. The file contains user profile information such as user ID, full name, and age, separated by colons (:).
mydata/users.txt john:John Doe:45 kelly:Kelly Johnson:32 laura:Laura Smith: phil:Phil Johnson:27
The first query creates a full-text index searchable by name.
import module "oxh:text"; import module "oxh:solr"; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $id := $split[1] let $name := $split[2] return solr:put( <doc> <field name="id">{ $id }</field> <field name="name">{ $name }</field> </doc> )
The second query accomplishes the same result, but uses a custom put function. It also defines all configuration parameters by using function annotations. Thus, setting configuration properties is not required when running this query.
import module "oxh:text"; declare %solr:put %solr-property:go-live %solr-property:zk-host("/solr") %solr-property:collection("collection1") function local:my-solr-put($doc as element(doc)) external; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $id := $split[1] let $name := $split[2] return local:my-solr-put( <doc> <field name="id">{ $id }</field> <field name="name">{ $name }</field> </doc> )
The Solr adapter configuration properties correspond to the Solr MapReduceIndexerTool
options.
MapReduceIndexerTool
is a MapReduce batch job driver that creates Solr index shards from input files, and writes the indexes into HDFS. It also supports merging the output shards into live Solr servers, typically a SolrCloud.
You can specify these properties with the generic -conf
and -D
hadoop
command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Solr adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop
command-line options.
Alternatively, you can specify these properties as Solr adapter put function annotations with the %solr-property
prefix. These annotations are identified in the property descriptions. Annotations apply only to the particular Solr adapter put function that contains them in its declaration.
See Also:
For discussions about how Solr uses the MapReduceIndexerTool
options, see the Cloudera Search User Guide at
Property | Overview |
---|---|
oracle.hadoop.xquery.solr.loader.collection |
Type: String Default Value: Not defined Equivalent Annotation: Description: The SolrCloud collection for merging the index, such as |
oracle.hadoop.xquery.solr.loader.fair-scheduler-pool |
Type: String Default Value: Not defined Equivalent Annotation: Description: The name of the fair scheduler pool for submitting jobs. The job runs using fair scheduling instead of the default Hadoop scheduling method. Optional. |
oracle.hadoop.xquery.solr.loader.go-live |
Type: String values Default Value: Equivalent Annotation: Description: Set to |
oracle.hadoop.xquery.solr.loader.go-live-threads |
Type: Integer Default Value: Equivalent Annotation: Description: The maximum number of live merges that can run in parallel. Optional. |
oracle.hadoop.xquery.solr.loader.log4j |
Type: String Default Value: Equivalent Annotation: Description: The relative or absolute path to the This file is uploaded for each MapReduce task. |
oracle.hadoop.xquery.solr.loader.mappers |
Type: String Default Value: Equivalent Annotation: Description: The maximum number of mapper tasks that Solr uses. A value of |
oracle.hadoop.xquery.solr.loader.max-segments |
Type: String Default Value: Equivalent Annotation: Description: The maximum number of segments in the index generated by each reducer. |
oracle.hadoop.xquery.solr.loader.reducers |
Type: String Default Value: Equivalent Annotation: Description: The number of reducers to use:
|
oracle.hadoop.xquery.solr.loader.zk-host |
Type: String Default Value: Not defined Equivalent Annotation: Description: The address of a ZooKeeper ensemble used by the SolrCloud cluster. Specify the address as a list of comma-separated host:port pairs, each corresponding to a ZooKeeper server. For example, If the address starts with a slash (/), such as This property enables Solr to determine the number of output shards to create and the Solr URLs in which to merge them. Use this property with |
The text file adapter provides functions to read and write text files stored in HDFS. It is described in the following topics:
To use the built-in functions in your query, you must import the text file module as follows:
import module "oxh:text";
The text file module contains the following functions:
For examples, see "Examples of Text File Adapter Functions."
Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.
The files might be split up and processed in parallel by multiple tasks.
Signature
declare %text:collection("text") function text:collection($uris as xs:string*) as xs:string* external; declare %text:collection("text") function function text:collection($uris as xs:string*, $delimiter as xs:string?) as xs:string* external;
Parameters
$uris
: The text file URIs.
$delimiter
: A custom delimiter on which the file is split. The default is the newline character.
Returns
One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file
Accesses a collection of text files in HDFS. The files can be compressed using a Hadoop-supported compression codec. They are automatically decompressed when read.
The files might be split up and processed in parallel by multiple tasks. Each delimited section of each file is parsed as an XML document and returned by the function. Therefore, each segment must fully contain a single XML document, and any delimit characters in the XML must be escaped with XML character references. By default, the delimiter is a new line.
Signature
declare %text:collection("xml") function text:collection-xml($uris as xs:string*) as document-node()* external; declare %text:collection("xml") function text:collection-xml($uris as xs:string*, $delimiter as xs:string?) as document-node()* external;
Parameters
$uris
: The text file URIs.
$delimiter
: A custom delimiter on which the file is split. The default is the newline character.
Returns
One string value for each file segment identified by the delimiter; for the default delimiter, a string value for each line in each file
Writes a line to a text file in the output directory of the query. The lines are spread across one or more files.
Signature
declare %text:put("text") function text:put($value as xs:string) external;
Parameters
$value
: The text to write
Returns
empty-sequence()
Notes
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
Writes XML to a line in a text file. The lines are spread across one or more files in the output directory of the query.
Newline characters in the serialized XML are replaced with character references to ensure that the XML does not span multiple lines. For example, 

replaces the linefeed character (\n
).
Signature
declare %text:put("xml") function text:put-xml($value as node()) external;
Parameters
$value
: The XML to write
Returns
empty-sequence()
Notes
The number of files created depends on how the query is distributed among tasks. Each file has a name that starts with part
, such as part-m-00000
. You specify the output directory when the query executes. See "Running Queries."
Writes a line to a text file named trace-*
in the output directory of the query. The lines are spread across one or more files.
This function provides you with a quick way to write to an alternate output. For example, you might create a trace file to identify invalid rows within a query, while loading the data into an Oracle database table.
Signature
declare %text:put("text") %text:file("trace") function text:trace($value as xs:string) external;
Parameters
$value
: The text to write
Returns
empty-sequence()
You can use the following annotations to define functions that read collections of text files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
The input files can be compressed with a Hadoop-supported compression codec. They are automatically decompressed when read.
Signature
Custom functions for reading text files must have one of the following signatures:
declare %text:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as xs:string* external; declare %text:collection("text") [additional annotations] function local:myFunctionName($uris as xs:string*) as xs:string* external; declare %text:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*, $delimiter as xs:string?) as document-node()* external declare %text:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Annotations
Declares the text
collection function. Required.
The optional method parameter can be one of the following values:
text
: Each line in the text file is returned as xs:string
. Default.
xml
: Each line in the text file is parsed as XML and returned as document-node
. Each XML document must be fully contained on a single line. Newline characters inside the document must be represented by a numeric character reference.
Specifies a custom delimiter for splitting the input files. The default delimiter is the newline character.
Do not combine this annotation with the $delimiter
parameter. To specify a custom delimiter, use either this annotation or the $delimiter
parameter.
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%text:split-max(1024) %text:split-max("1024") %text:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%text:split-min(1024) %text:split-min("1024") %text:split-min("1K")
Parameters
Lists the HDFS file URIs. The files can be uncompressed or compressed with a Hadoop-supported codec. Required.
A custom delimiter on which the input text files are split. The default delimiter is a new line. Do not combine this parameter with the %text:split
annotation.
Returns
xs:string*
for the text
method
document-node()*
for the xml
method
You can use the following annotations to define functions that write text files in HDFS.
Signature
Custom functions for writing text files must have one of the following signatures:
declare %text:put("text") [additional annotations] function text:myFunctionName($value as xs:string) external; declare %text:put("xml") [additional annotations] function text:myFunctionName($value as node()) external;
Annotations
Declares the text
put function. Required.
The optional method parameter can be one of the following values:
text
: Writes data to a text file. Default.
xml
: Writes data to an XML file. The XML is serialized and newline characters are replaced with character references. This process ensures that the resulting XML document is one text line with no line breaks.
Specifies the compression format used on the output. The default is no compression. Optional.
The codec parameter identifies a compression codec. The first registered compression codec that matches the value is used. The value matches a codec if it equals one of the following:
The fully qualified class name of the codec
The unqualified class name of the codec
The prefix of the unqualified class name before "Codec" (case insensitive)
All of these examples use the default codec and block compression:
%text:compress("org.apache.hadoop.io.compress.DefaultCodec", "block") %text:compress("DefaultCodec", "block") %text:compress("default", "block")
Specifies the output file name prefix. The default prefix is part
.
A standard XQuery serialization parameter for the output method (text or XML) specified in %text:put
. See "Serialization Annotations."
UTF-8 is currently the only supported character encoding.
Example 6-19 Using Built-in Functions to Query Text Files
This example uses following text files in HDFS. The files contain a log of visits to different web pages. Each line represents a visit to a web page and contains the time, user name, and page visited.
mydata/visits1.log 2013-10-28T06:00:00, john, index.html, 200 2013-10-28T08:30:02, kelly, index.html, 200 2013-10-28T08:32:50, kelly, about.html, 200 2013-10-30T10:00:10, mike, index.html, 401 mydata/visits2.log 2013-10-30T10:00:01, john, index.html, 200 2013-10-30T10:05:20, john, about.html, 200 2013-11-01T08:00:08, laura, index.html, 200 2013-11-04T06:12:51, kelly, index.html, 200 2013-11-04T06:12:40, kelly, contact.html, 200
The following query filters out the pages visited by john
and writes only the date and page visited to a new text file:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") where $split[2] eq "john" return text:put($split[1] || " " || $split[3])
This query creates a text file that contains the following lines:
2013-10-28T06:00:00 index.html 2013-10-30T10:00:01 index.html 2013-10-30T10:05:20 about.html
The next query computes the number of page visits per day:
import module "oxh:text"; for $line in text:collection("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $time := xs:dateTime($split[1]) let $day := xs:date($time) group by $day return text:put($day || " => " || count($line))
The query creates text files that contain the following lines:
2013-10-28 => 3 2013-10-30 => 3 2013-11-01 => 1 2013-11-04 => 2
Example 6-20 Querying Simple Delimited Formats
This example uses the fn:tokenize
function to parse the lines of a text file. This technique works well for simple delimited formats.
The following query declares custom put and collection functions. It computes the number of hits and the number of unique users for each page in the logs.
import module "oxh:text"; declare %text:collection("text") %text:split-max("32m") function local:col($uris as xs:string*) as xs:string* external; declare %text:put("xml") %text:compress("gzip") %text:file("pages") function local:out($arg as node()) external; for $line in local:col("mydata/visits*.log") let $split := fn:tokenize($line, "\s*,\s*") let $user := $split[2] let $page := $split[3] group by $page return local:out( <page> <name>{$page}</name> <hits>{count($line)}</hits> <users>{fn:count(fn:distinct-values($user))}</users> </page> )
The output directory of the previous query is named myoutput
. The following lines are written to myoutput/pages-r-*.gz
.
<page><name>about.html</name><hits>2</hits><users>2</users></page> <page><name>contact.html</name><hits>1</hits><users>1</users></page> <page><name>index.html</name><hits>6</hits><users>4</users></page>
The files are compressed with the gzip
codec. The following query reads the output files, and writes the page name and total hits as plain text. The collection function automatically decodes the compressed files.
import module "oxh:text"; for $page in text:collection-xml("myoutput/page*.gz")/page return text:put($page/name || "," || $page/hits)
This query creates text files that contain the following lines:
about.html,2 contact.html,1 index.html,6
Example 6-21 Querying Complex Text Formats
The fn:tokenize
function might not be adequate for complex formats that contain variety of data types and delimiters. This example uses the fn:analyze-string
function to process a log file in the Apache Common Log format.
A text file named mydata/access.log in HDFS contains the following lines:
192.0.2.0 - - [30/Sep/2013:16:39:38 +0000] "GET /inddex.html HTTP/1.1" 404 284 192.0.2.0 - - [30/Sep/2013:16:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:10:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.4 - - [01/Oct/2013:12:12:12 +0000] "GET /about.html HTTP/1.1" 200 4567 192.0.2.1 - - [02/Oct/2013:08:39:38 +0000] "GET /indexx.html HTTP/1.1" 404 284 192.0.2.1 - - [02/Oct/2013:08:40:54 +0000] "GET /index.html HTTP/1.1" 200 12390 192.0.2.1 - - [02/Oct/2013:08:42:38 +0000] "GET /aobut.html HTTP/1.1" 404 283
The following query computes the requests made after September 2013 when the server returned a status code 404 (Not Found) error. It uses a regular expression and fn:analyze-string
to match the components of the log entries. The time format cannot be cast directly to xs:dateTime
, as shown in Example 6-20. Instead, the ora-fn:dateTime-from-string-with-format
function converts the string to an instance of xs:dateTime
.
import module "oxh:text"; declare variable $REGEX := '(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]+)" (\S+) (\S+)'; for $line in text:collection("mydata/access.log") let $match := fn:analyze-string($line, $REGEX)/fn:match let $time := ora-fn:dateTime-from-string-with-format( "dd/MMM/yyyy:HH:mm:ss Z", $match/fn:group[4] ) let $status := $match/fn:group[6] where $status eq "404" and $time ge xs:dateTime("2013-10-01T00:00:00") let $host := $match/fn:group[1] let $request := $match/fn:group[5] return text:put($host || "," || $request)
The query creates text files that contain the following lines:
192.0.2.1,GET /indexx.html HTTP/1.1 192.0.2.1,GET /aobut.html HTTP/1.1
See Also:
XPath and XQuery Functions and Operators 3.0 specification for information about the fn:tokenize
and fn:analyze-string
functions:
fn:tokenize at http://www.w3.org/TR/xpath-functions-31/#func-tokenize
fn:analyze-string at http://www.w3.org/TR/xpath-functions-31/#func-analyze-string
For information about the Apache Common log format:
This adapter provides functions to parse files stored in HDFS in various formats using Apache Tika library. It is described in the following topics:
To use the built-in functions in your query, you must import the Tika file module as follows:
import module "oxh:tika";
The Tika file module contains the following functions:
For examples, see "Examples of Tika File Adapter Functions."
Parses files stored in HDFS in various formats and extracts the content or metadata from them.
Signature
declare %tika:collection function tika:collection($uris as xs:string*) as document-node()* external; declare %tika:collection function function tika:collection($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
Parameters
$uris
: The HDFS file URIs.
$contentType
: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.
Returns
Returns a document node for each value. See "Tika Parser Output Format".
Parses the data given to it as an argument.For example, it can parse an html fragment within an XML or JSON document.
Signature
declare function tika:parse($data as xs:string?, $contentType as xs:string?) as document-node()* external;
Parameters
$data
: The value to be parsed.
$contentType
: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.
Returns
Returns a document node for each value. See "Tika Parser Output Format".
You can use the following annotations to define functions to parse files in HDFS with Tika. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading HDFS files must have one of the following signatures:
declare %tika:collection [additional annotations] function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external; declare %tika:collection [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Annotations
Identifies an external function to be implemented by Tika file adapter. Required.
The optional method parameter can be one of the following values:
tika
: Each line in the tika file is returned as document-node()
. Default.
Declares the file content type. It is a MIME type and must not have the charset attribute as per XQuery specifications. Optional.
Declares the file character set. Optional.
Note:
%output:media-type
and %output:econding
annotations specify the content type or encoding when the $contentType parameter is not explicitly provided in the signature.
Parameters
Returns
document-node()*
with two root elements. See "Tika Parser Output Format".
The result of Tika parsing is a document node with two root elements:
Root element #1 is an XHTML content produced by Tika.
Root element #2 is the document metadata extracted by Tika.
The format of the root elements look like these:
Root element #1
<html xmlns="http://www.w3.org/1999/xhtml"> ...textual content of Tika HTML... </html>
Root element #2
<tika:metadata xmlns:tika="oxh:tika"> <tika:property name="Name_1">VALUE_1</tika:property> <tika:property name="NAME_2">VALUE_2</tika:property> </tika:metadata>
The following Hadoop properties control the behavior of Tika adapter:
Type:Boolean
Default Value: false.
Description: When this is set to TRUE, then all the HTML elements are omitted during parsing. When this is set to FALSE, then only the safe elements are omitted during parsing.
Type:Comma-separated list of strings
Default Value:Not Defined.
Description:Defines the locale to be used by some Tika parsers such as Microsoft Office document parser. Only three strings are allowed: language, country, and variant. The strings country and variant are optional. When locale is not defined, then the system locale is used. When the strings are defined it must correspond to the java.util.Locale
specification format mentioned in http://docs.oracle.com/javase/7/docs/api/java/util/Locale.html
and the locale can be constructed as follows:
If only language is specified, then the locale is constructed from the language.
If the language and country are specified, then the locale is constructed from both language and country
If language, country, and variant are specified, then the locale is constructed from language, country, and variant.
Example 6-22 Using Built-in Functions to Index PDF documents with Cloudera Search
This example query uses Tika to parse PDF files into HTML form and then add the HTML documents into Solr's full-text index.
*bigdata*.pdf
The following query indexes the HDFS files:
import module "oxh:tika"; import module "oxh:solr"; for $doc in tika:collection("*bigdata*.pdf") let $docid := data($doc//*:meta[@name eq "resourceName"]/@content)[1] let $body := $doc//*:body[1] return solr:put( <doc> <field name="id">{ $docid }</field> <field name="text">{ string($body) }</field> <field name="content">{ serialize($doc/*:html) }</field> </doc> )
The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.
Example 6-23 Using Built-in Functions to Index HTML documents with Cloudera Search
This example query uses sequence files and Tika to parse, where key is an URL and value is a html.
import module "oxh:tika"; import module "oxh:solr"; import module "oxh:seq"; for $doc in seq:collection-tika(“/path/to/seq/files/*") let $docid := document-uri($doc) let $body := $doc//*:body[1] return solr:put( <doc> <field name="id">{ $docid }</field> <field name="text">{ string($body) }</field> <field name="content">{ serialize($doc/*:html) }</field> </doc> )
The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.
The XML file adapter provides access to XML files stored in HDFS. The adapter optionally splits individual XML files so that a single file can be processed in parallel by multiple tasks.
This adapter is described in the following topics:
To use the built-in functions in your query, you must import the XML file module as follows:
import module "oxh:xmlf";
The XML file module contains the following functions:
Accesses a collection of XML documents in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task.
This function automatically decompresses files compressed with a Hadoop-supported codec.
Note:
HDFS does not perform well when data is stored in many small files. For large data sets with many small XML documents, use Hadoop sequence files and the Sequence File Adapter.
Signature
declare %xmlf:collection function xmlf:collection($uris as xs:string*) as document-node()* external;
Parameters
$uris
: The XML file URIs
Returns
One XML document for each file
Accesses a collection of XML documents in HDFS. Multiple files can be processed concurrently, but each individual file is parsed by a single task. This function is the same as xmlf:collection except that each file may contain multiple well-formed XML documents concatenated together.
This function automatically decompresses files compressed with a Hadoop-supported codec. For example, a file containing multiple XML documents could be compressed using GZIP and then accessed directly by this function.
Signature
declare %xmlf:collection("multipart") function xmlf:collection($uris as xs:string*) as document-node()* external;
Returns
One or more XML documents for each file.
Accesses a collection of XML documents in HDFS. The files might be split and processed by multiple tasks simultaneously, which enables very large XML files to be processed efficiently. The function returns only elements that match a specified name.
This function does not automatically decompress files. It only supports XML files that meet certain requirements. See "Restrictions on Splitting XML Files."
Signature
declare %xmlf:collection function xmlf:collection($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;
Parameters
Returns
Each element that matches one of the names specified by the $names
argument
You can use the following annotations to define functions that read collections of XML files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.
Signature
Custom functions for reading XML files must have one of the following signatures:
declare %xmlf:collection(["xml"|"multipart"]) [additional annotations] function local:myFunctionName($uris as xs:string*) as node()* external; declare %xmlf:collection("xml") [additional annotations] function local:myFunctionName($uris as xs:string*, $names as xs:anyAtomicType+) as element()* external;
Annotations
Declares the collection function. Required.
The method parameter is one of the following values:
xml: Each value is parsed as XML
multipart: Each value (or, file) may contain a concatenation of multiple well-formed XML documents. This method cannot be used with parallel XML parsing. (See xmlf:split and the two-argument function signature.)
Specifies the element names used for parallel XML parsing. You can use this annotation instead of the $names
argument.
When this annotation is specified, only the single-argument version of the function is allowed. This restriction enables the element names to be specified statically, so they do not need to be specified when the function is called.
Identifies the text encoding of the input documents.
When this encoding is used with the %xmlf:split
annotation or the $names
argument, only ISO-8859-1, US-ASCII, and UTF-8 are valid encodings. Otherwise, the valid encodings are those supported by the JVM. UTF-8 is assumed when this annotation is omitted.
See Also:
"Supported Encodings" in the Oracle Java SE documentation at
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
This annotation provides extra namespace declarations to the parser. You can specify it multiple times to declare one or more namespaces.
Use this annotation to declare the namespaces of ancestor elements. When XML is processed in parallel, only elements that match the specified names are processed by an XML parser. If a matching element depends on the namespace declaration of one of its ancestor elements, then the declaration is not visible to the parser and an error may occur.
These namespace declarations can also be used in element names when specifying the split names. For example:
declare %xmlf:collection %xmlf:split("eg:foo") %xmlf:split-namespace("eg", "http://example.org") function local:myFunction($uris as xs:string*) as document-node() external;
Provides entity definitions to the XML parser. When XML is processed in parallel, only elements that match the specified split names are processed by an XML parser. The DTD of an input document that is split and processed in parallel is not processed.
In this example, the XML parser expands &foo;
entity references as "Hello World":
%xmlf:split-entity("foo","Hello World")
Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit).These qualifiers are not case sensitive. The following examples are equivalent:
%xmlf:split-max(1024) %xmlf:split-max("1024") %xmlf:split-max("1K")
Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size))
. Optional.
In a string value, you can append K
, k
, M
, m
, G
, or g
to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:
%xmlf:split-min(1024) %xmlf:split-min("1024") %xmlf:split-min("1K")
Notes
Restrictions on Splitting XML Files
Individual XML documents can be processed in parallel when the element names are specified using either the $names
argument or the $xmlf:split
annotation.
The input documents must meet the following constraints to be processed in parallel:
XML cannot contain a comment, CDATA
section, or processing instruction that contains text that matches one of the specified element names (that is, a <
character followed by a name that expands to a QName). Otherwise, such content might be parsed incorrectly as an element.
An element in the file that matches a specified element name cannot contain a descendant element that also matches a specified name. Otherwise, multiple processors might pick up the matching descendant and cause the function to produce incorrect results.
An element that matches one of the specified element names (and all of its descendants) must not depend on the namespace declarations of any of its ancestors. Because the ancestors of a matching element are not parsed, the namespace declarations in these elements are not processed.
You can work around this limitation by manually specifying the namespace declarations with the %xmlf:split-namespace
annotation.
Oracle recommends that the specified element names do not match elements in the file that are bigger than the split size. If they do, then the adapter functions correctly but not efficiently.
Processing XML in parallel is difficult, because parsing cannot begin in the middle of an XML file. XML constructs like CDATA
sections, comments, and namespace declarations impose this limitation. A parser starting in the middle of an XML document cannot assume that, for example, the string <foo>
is a begin element tag, without searching backward to the beginning of the document to ensure that it is not in a CDATA
section or a comment. However, large XML documents typically contain sequences of similarly structured elements and thus are amenable to parallel processing. If you specify the element names, then each task works by scanning a portion of the document for elements that match one of the specified names. Only elements that match a specified name are given to a true XML parser. Thus, the parallel processor does not perform a true parse of the entire document.
Example 6-24 Using Built-in Functions to Query XML Files
This example queries three XML files in HDFS with the following contents. Each XML file contains comments made by users on a specific day. Each comment can have zero or more "likes" from other users.
mydata/comments1.xml <comments date="2013-12-30"> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> </comments> mydata/comments2.xml <comments date="2013-12-31"> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments> mydata/comments3.xml <comments date="2014-01-01"> <comment id="87654" user="mike" text="I don't feel so good."/> <comment id="23456" user="john" text="What a beautiful day!"> <like user="kelly"/> <like user="phil"/> </comment> </comments>
This query writes the number of comments made each year to a text file. No element names are passed to xmlf:collection
, and so it returns three documents, one for each file. Each file is processed serially by a single task.
import module "oxh:xmlf"; import module "oxh:text"; for $comments in xmlf:collection("mydata/comments*.xml")/comments let $date := xs:date($comments/@date) group by $year := fn:year-from-date($date) return text:put($year || ", " || fn:count($comments/comment))
The query creates text files that contain the following lines:
2013, 3 2014, 2
The next query writes the number of comments and the average number of likes for each user. Each input file is split, so that it can be processed in parallel by multiple tasks. The xmlf:collection
function returns five elements, one for each comment.
import module "oxh:xmlf"; import module "oxh:text"; for $comment in xmlf:collection("mydata/comments*.xml", "comment") let $likeCt := fn:count($comment/like) group by $user := $comment/@user return text:put($user || ", " || fn:count($comment) || ", " || fn:avg($likeCt))
This query creates text files that contain the following lines:
john, 2, 1 kelly, 1, 2 mike, 2, 0.5
Example 6-25 Writing a Custom Function to Query XML Files
The following example declares a custom function to access XML files:
import module "oxh:text"; declare %xmlf:collection %xmlf:split("comment") %xmlf:split-max("32M") function local:comments($uris as xs:string*) as element()* external; for $c in local:comments("mydata/comment*.xml") where $c/@user eq "mike" return text:put($c/@id)
The query creates a text file that contains the following lines:
54321 87654
Example 6-26 Accessing Compressed, Multipart XML Files
Assume that files comments1.xml, comments2.xml, and comments3.xml from example 5-24 are concatenated together and compressed using GZIP to create a single file named comments.xml.gz. For example:
cat comments1.xml comments2.xml comments3.xml | gzip > comments.xml.gz
The following query accesses this multipart, compressed XML file:
import module "oxh:text"; import module "oxh:xmlf"; for $comment in xmlf:collection-multipart("comments.xml.gz")/comments/comment return text:put($comment/@id || "," || $comment/@user)
The query creates a text file that contains the following lines:
12345,john 56789,kelly 54321,mike 87654,mike 23456,john
The utility module contains ora-fn
functions for handling strings and dates. These functions are defined in XDK XQuery, whereas the oxh
functions are specific to Oracle XQuery for Hadoop.
The utility functions are described in the following topics:
You can manipulate durations, dates, and times in XQuery using Oracle XQuery functions.
The Oracle XQuery functions are in namespace http://xmlns.oracle.com/xdk/xquery/function
. Namespace prefixora-fn
is predeclared, and the module is automatically imported.
Returns a new date value from a string according to the specified pattern.
Signature
ora-fn:date-from-string-with-format($format as xs:string?, $dateString as xs:string?, $locale as xs:string*) as xs:date? ora-fn:date-from-string-with-format($format as xs:string?, $dateString as xs:string?) as xs:date?
Parameters
$format
: The pattern; see "Format Argument"
$dateString
: An input string that represents a date
$locale
: A one- to three-field value that represents the locale; see "Locale Argument"
Example
This example returns the specified date in the current time zone:
ora-fn:date-from-string-with-format("yyyy-MM-dd G", "2013-06-22 AD")
Returns a date string with the specified pattern.
Signature
ora-fn:date-to-string-with-format($format as xs:string?, $date as xs:date?, *$locale as xs:string?) as xs:string? ora-fn:date-to-string-with-format($format as xs:string?, $date as xs:date?) as xs:string?
Parameters
$format
: The pattern; see Format Argument
$date
: The date
$locale
: A one- to three-field value that represents the locale; see Locale Argument
Example
This example returns the string 2013-07-15
:
ora-fn:date-to-string-with-format("yyyy-mm-dd", xs:date("2013-07-15"))
Returns a new date-time value from an input string according to the specified pattern.
Signature
ora-fn:dateTime-from-string-with-format($format as xs:string?, $dateTimeString as xs:string?, $locale as xs:string?) as xs:dateTime? ora-fn:dateTime-from-string-with-format($format as xs:string?, $dateTimeString as xs:string?) as xs:dateTime?
Parameters
$format
: The pattern; see Format Argument
$dateTimeString
: The date and time
$locale
: A one- to three-field value that represents the locale; see "Locale Argument"
Examples
This example returns the specified date and 11:04:00AM in the current time zone:
ora-fn:dateTime-from-string-with-format("yyyy-MM-dd 'at' hh:mm", "2013-06-22 at 11:04")
The next example returns the specified date and 12:00:00AM in the current time zone:
ora-fn:dateTime-from-string-with-format("yyyy-MM-dd G", "2013-06-22 AD")
Returns a date and time string with the specified pattern.
Signature
ora-fn:dateTime-to-string-with-format($format as xs:string?, $dateTime as xs:dateTime?, $locale as xs:string?) as xs:string? ora-fn:dateTime-to-string-with-format($format as xs:string?, $dateTime as xs:dateTime?) as xs:string?
Parameters
$format
: The pattern; see "Format Argument"
$dateTime
: The date and time
$locale
: A one- to three-field value that represents the locale; see "Locale Argument"
Examples
This example returns the string 07 JAN 2013 10:09 PM AD
:
ora-fn:dateTime-to-string-with-format("dd MMM yyyy hh:mm a G", xs:dateTime("2013-01-07T22:09:44"))
The next example returns the string "01-07-2013":
ora-fn:dateTime-to-string-with-format("MM-dd-yyyy", xs:dateTime("2013-01-07T22:09:44"))
Returns a new time value from an input string according to the specified pattern.
Signature
ora-fn:time-from-string-with-format($format as xs:string?, $timeString as xs:string?, $locale as xs:string?) as xs:time? ora-fn:time-from-string-with-format($format as xs:string?, $timeString as xs:string?) as xs:time?
Parameters
$format
: The pattern; see "Format Argument"
$timeString
: The time
$locale
: A one- to three-field value that represents the locale; see Locale Argument
Example
This example returns 9:45:22 PM in the current time zone:
ora-fn:time-from-string-with-format("HH.mm.ss", "21.45.22")
The next example returns 8:07:22 PM in the current time zone:
fn-bea:time-from-string-with-format("hh:mm:ss a", "8:07:22 PM")
Returns a time string with the specified pattern.
Signature
ora-fn:time-to-string-with-format($format as xs:string?, $time as xs:time?, $locale as xs:string?) as xs:string? ora-fn:time-to-string-with-format($format as xs:string?, $time as xs:time?) as xs:string?
Parameters
$format
: The pattern; see "Format Argument"
$time
: The time
$locale
: A one- to three-field value that represents the locale; see "Locale Argument"
Examples
This example returns the string "10:09 PM":
ora-fn:time-to-string-with-format("hh:mm a", xs:time("22:09:44"))
The next example returns the string "22:09 PM":
ora-fn:time-to-string-with-format("HH:mm a", xs:time("22:09:44"))
The $format
argument identifies the various fields that compose a date or time value.
The $locale
represents a specific geographic, political, or cultural region.
It is defined by up to three fields:
Language code: The ISO 639 alpha-2 or alpha-3 language code, or the registered language subtags of up to eight letters. For example, en
for English and ja
for Japanese.
Country code: The ISO 3166 alpha-2 country code or the UN M.49 numeric-3 area code. For example, US
for the United States and 029
for the Caribbean.
Variant: Indicates a variation of the locale, such as a particular dialect. Order multiple values in order of importance and separate them with an underscore (_). These values are case sensitive.
See Also:
Class Locale in Java Standard Edition 7 Reference
You can manipulate strings in XQuery using Oracle XQuery functions.
The Oracle XQuery functions are in namespace http://xmlns.oracle.com/xdk/xquery/function
. Namespace prefixora-fn
is predeclared, and the module is automatically imported.
Adds padding characters to the left of a string to create a fixed-length string. If the input string exceeds the specified size, then it is truncated to return a substring of the specified length. The default padding character is a space (ASCII 32).
Signature
ora-fn:pad-left($str as xs:string?, $size as xs:integer?, $pad as xs:string?) as xs:string? ora-fn:pad-left($str as xs:string?, $size as xs:integer?) as xs:string?
Parameters
$str
: The input string
$size
: The desired fixed length, which is obtained by adding padding characters to $str
$pad
: The padding character
If either argument is an empty sequence, then the function returns an empty sequence.
Examples
This example prefixes "01
" to the input string up to the maximum of six characters. The returned string is "010abc
". The function returns one complete and one partial pad character.
ora-fn:pad-left("abc", 6, "01")
The example returns only "ab
" because the input string exceeds the specified fixed length:
ora-fn:pad-left("abcd", 2, "01")
This example prefixes spaces to the string up to the specified maximum of six characters. The returned string has a prefix of two spaces: " abcd
":
ora-fn:pad-left("abcd", 6)
The next example returns only "ab
" because the input string exceeds the specified fixed length:
ora-fn:pad-left("abcd", 2)
Adds padding characters to the right of a string to create a fixed-length string. If the input string exceeds the specified size, then it is truncated to return a substring of the specified length. The default padding character is a space (ASCII 32).
Signature
ora-fn:pad-right($str as xs:string?, $size as xs:integer?, $pad as xs:string?) as xs:string? ora-fn:pad-right($str as xs:string?, $size as xs:integer?) as xs:string?
Parameters
$str
: The input string
$size
: The desired fixed length, which is obtained by adding padding characters to $str
$pad
: The padding character
If either argument is an empty sequence, then the function returns an empty sequence.
Examples
This example appends "01
" to the input string up to the maximum of six characters. The returned string is "abc010
". The function returns one complete and one partial pad character.
ora-fn:pad-right("abc", 6, "01")
This example returns only "ab
" because the input string exceeds the specified fixed length:
ora-fn:pad-right("abcd", 2, "01")
This example appends spaces to the string up to the specified maximum of six characters. The returned string has a suffix of two spaces: "abcd
":
ora-fn:pad-right("abcd", 6)
The next example returns only "ab
" because the input string exceeds the specified fixed length:
ora-fn:pad-right("abcd", 2)
Removes any leading or trailing white space from a string.
Signature
ora-fn:trim($input as xs:string?) as xs:string?
Parameters
$input
: The string to trim. If $input
is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.
Example
This example returns the string "abc
":
ora-fn:trim(" abc ")
Removes any leading white space.
Signature
ora-fn:trim-left($input as xs:string?) as xs:string?
Parameters
$input
: The string to trim. If $input
is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.
Example
This example removes the leading spaces and returns the string "abc
":
ora-fn:trim-left(" abc ")
Removes any trailing white space.
Signature
ora-fn:trim-right($input as xs:string?) as xs:string?
Parameters
$input
: The string to trim. If $input
is an empty sequence, then the function returns an empty sequence. Other data types trigger an error.
Example
This example removes the trailing spaces and returns the string " abc
":
ora-fn:trim-left(" abc ")
These functions are in the http://xmlns.oracle.com/hadoop/xquery
namespace. The oxh
prefix is predeclared and the module is automatically imported.
The Hadoop module is described in the following topic:
Hadoop Functions
The following functions are built in to Oracle XQuery for Hadoop:
Returns a sequence of file paths that match a pattern.
Signature
oxh:find($pattern as xs:string?) as xs:string*
Parameters
$pattern
: The file pattern to search for
See Also:
For the file pattern, the globStatus
method in the Apache Hadoop API at
Increments a user-defined MapReduce job counter. The default increment is one (1).
Signature
oxh:increment-counter($groupName as xs:string, $counterName as xs:string, $value as xs:integer oxh:increment-counter($groupName as xs:string, $counterName as xs:string
Parameters
$groupName
: The group of counters that this counter belongs to.
$counterName
: The name of a user-defined counter
$value
: The amount to increment the counter
Prints a line of text to stdout
of the Oracle XQuery for Hadoop client process. Use this function when developing queries.
Signature
declare %updating function oxh:println($arg as xs:anyAtomicType?)
Parameters
$arg
: A value to add to the output. A cast
operation first converts it to string
. An empty sequence is handled the same way as an empty string.
Example
This example prints the values of data.txt
to stdout
:
for $i in text:collection("data.txt") return oxh:println($i)
Prints a line of text or XML to stdout
of the Oracle XQuery for Hadoop client process. Use this function when developing queries and printing nodes of an XML document.
Signature
declare %updating function oxh:println-xml($arg as item()?)
Parameters
$arg
: A value to add to the output. The input item is converted into a text as defined by XSLT 2.0 and XQuery 1.0 Serialization specifications. An empty sequence is handled the same way as an empty string.
Several adapters have serialization annotations (%output:*
). The following lists identify the serialization parameters that Oracle XQuery for Hadoop supports.
Serialization parameters supported for the text
output method:
encoding
: Any encoding supported by the JVM
normalization-form
: none, NFC, NFD, NFKC, NFKD
Serialization parameters supported for the xml
output method, using any values permitted by the XQuery specification:
cdata-section-elements
doctype-public
doctype-system
encoding
indent
normalization-form
omit-xml-declaration
standalone
See Also:
"The Influence of Serialization Parameters" sections for XML and text output methods in XSLT and XQuery Serialization, at locations like the following:
http://www.w3.org/TR/xslt-xquery-serialization/#XML_DOCTYPE
http://www.w3.org/TR/xslt-xquery-serialization/#XML_CDATA-SECTION-ELEMENTS