Avro File Adapter

The Avro file adapter provides functions to read and write Avro container files in HDFS. It is described in the following topics:

Built-in Functions for Reading Avro Files

To use the built-in functions in your query, you must import the Avro file module as follows:

import module "oxh:avro";

The Avro file module contains the following functions:

There are no built-in functions for writing Avro container files. To write Avro files, you must use a custom function that specifies the Avro writer schema.

avro:collection-avroxml

Accesses a collection of Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."

Signature

declare %avro:collection("avroxml") function 
   avro:collection-avroxml($uris as xs:string*) as element()* external;

Parameters

$uris: The Avro file URIs

Returns

One XML element for each Avro object.

avro:get

Retrieves an entry from an Avro map modeled as XML

If you omit the $map parameter, then the behavior is identical to calling the two-argument function and using the context item for $map.

Signature

avro:get($key as xs:string?, $map as node()?) as element(oxh:entry)?
 
avro:get($key as xs:string?) as element(oxh:entry)?

Returns

The value of this XPath expression:

$map/oxh:entry[@key eq $key]

Example

These function calls are equivalent:

$var/avro:get("key")

avro:get("key", $var)
 
$var/oxh:entry[@key eq "key"]

In this example, $var is an Avro map modeled as XML. See "Reading Maps."

Custom Functions for Reading Avro Container Files

You can use the following annotations to define functions that read collections of Avro container files in HDFS. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading Avro files must have the following signature:

declare %avro:collection("avroxml") [additional annotations]
   function local:myFunctionName($uris as xs:string*) as element()* external;

Annotations

%avro:collection("avroxml")

Declares the avroxml collection function. Required.

A collection function accesses Avro files in HDFS. The files might be split up and processed in parallel by multiple tasks. The function returns an XML element for each object. See "About Converting Values Between Avro and XML."

%avro:schema("avro-schema")

Provides the Avro reader schema as the value of the annotation. Optional.

The objects in the file are mapped to the reader schema when it is specified. For example:

%avro:schema('
   {
      "type": "record",
      "name": "Person",
      "fields" : [
         {"name": "full_name", "type": "string"},
         {"name": "age", "type": ["int", "null"] }
       ]
   }
')

You cannot combine this annotation with %avro:schema-file or %avro:schema-kv.

See Also:

"Schema Resolution" in the Apache Avro Specification at

http://avro.apache.org/docs/current/spec.html#Schema+Resolution

%avro:schema-file("avro-schema-uri")

Like %avro:schema, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system. Optional.

For example, %avro:schema-file("schemas/person.avsc").

You cannot combine this annotation with %avro:schema or %avro:schema-kv.

%avro:schema-kv("schema-name")

Like %avro:schema, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog. Optional.

For example, %avro:schema-kv("org.example.PersonRecord").

You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."

You cannot combine this annotation with %avro:schema or %avro:schema-file.

%avro:split-max("split-size")

Specifies the maximum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%avro:split-max(1024)
%avro:split-max("1024")
%avro:split-max("1K")
%avro:split-min("split-size")

Specifies the minimum split size as either an integer or a string value. The split size controls how the input file is divided into tasks. Hadoop calculates the split size as max($split-min, min($split-max, $block-size)). Optional.

In a string value, you can append K, k, M, m, G, or g to the value to indicate kilobytes, megabytes, or gigabytes instead of bytes (the default unit). These qualifiers are not case sensitive. The following examples are equivalent:

%avro:split-min(1024)
%avro:split-min("1024")
%avro:split-min("1K")

Custom Functions for Writing Avro Files

You can use the following annotations to define functions that write Avro files.

Signature

Custom functions for writing Avro files must have the following signature:

declare %avro:put("avroxml") [additional annotations]
   local:myFunctionName($value as item()) external;

Annotations

%avro:put("avroxml")

Declares the avroxml put function. Required.

An Avro schema must be specified using one of the following annotations:

  • %avro:schema

  • %avro:schema-file

  • %avro:schema-kv

The input XML value is converted to an instance of the schema. See "Writing XML as Avro."

%avro:schema("avro-schema")

Specifies the schema of the files. For example:

%avro:schema('
   {
      "type": "record",
      "name": "Person",
      "fields" : [
         {"name": "full_name", "type": "string"},
         {"name": "age", "type": ["int", "null"] }
      ]
   }
')

You cannot combine this annotation with %avro:schema-file or %avro:schema-kv.

%avro:schema-file("avro-schema-uri")

Like %avro:schema, but the annotation value is a file URI that contains the Avro reader schema. Relative URIs are resolved against the current working directory of the client's local file system.

For example: %avro:schema-file("schemas/person.avsc")

You cannot combine this annotation with %avro:schema or %avro:schema-kv.

%avro:schema-kv("schema-name")

Like %avro:schema, but the annotation value is a fully qualified record name. The record schema is retrieved from the Oracle NoSQL Database catalog.

For example: %avro:schema-kv("org.example.PersonRecord")

You must specify the connection parameters to Oracle NoSQL Database when you use this annotation. See "Oracle NoSQL Database Adapter Configuration Properties."

You cannot combine this annotation with %avro:schema or %avro:schema-file.

%avro:compress("method", [level]?)

Specifies the compression format used on the output.

The codec is one of the following string literal values:

  • deflate: The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.

  • snappy: This algorithm is designed for high speed and moderate compression.

The default is no compression.

The level is an integer value. It is optional and only supported when codec is deflate.

For example:

%avro:compress("snappy")
%avro:compress("deflate")
%avro:compress("deflate", 3)
%avro:file("name")

Specifies the output file name prefix. The default prefix is part.

Examples of Avro File Adapter Functions

These examples use the following text file in HDFS:

mydata/ages.txt

john,45
kelly,36
laura,
mike,27
Example 1   Converting a Text File to Avro

The following query converts the file into compressed Avro container files:

import module "oxh:text";
 
declare 
   %avro:put("avroxml")
   %avro:compress("snappy")
   %avro:schema('
      {
         "type": "record",
         "name": "AgeRec",
         "fields" : [
            {"name": "user", "type": "string"},
            {"name": "age", "type": ["int", "null"] }
         ]
      }
   ')
function local:put($arg as item()) external;

for $line in text:collection("mydata/ages.txt")
let $split := fn:tokenize($line, ",")
return
   local:put(
      <rec>
         <user>{$split[1]}</user>
         {
            if ($split[2] castable as xs:int) then
               <age>{$split[2]}</age>
            else 
               ()
         }
      </rec>
   )

The query generates an Avro file with the following records, represented here as JSON:

{"user":"john","age":{"int":45}}
{"user":"kelly","age":{"int":36}}
{"user":"laura","age":null}
{"user":"mike","age":{"int":27}}
Example 2   Querying Records in Avro Container Files

The next query selects records in which the age is either null or greater than 30, from the myoutput directory. The query in Example 1 generated the records.

import module "oxh:text";
import module "oxh:avro";

for $rec in avro:collection-avroxml("myoutput/part*.avro")
where $rec/age/nilled() or $rec/age gt 30
return
   text:put($rec/user)

This query creates files that contain the following lines:

john
kelly
laura

About Converting Values Between Avro and XML

This section describes how Oracle XQuery for Hadoop converts data between Avro and XML:

Reading Avro as XML

Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml method, which you can use with the collection functions to read Avro records as XML. After the Avro is converted to XML, you can query and transform the data using XQuery.

The following topics describe how Oracle XQuery for Hadoop reads Avro:

Reading Records

An Avro record is converted to an <oxh:item> element with one child element for each field in the record.

For example, consider the following Avro schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

This is an instance of the record modeled as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <age>46</age>
</oxh:item>

Converting Avro records to XML enables XQuery to query them. The next example queries an Avro container file named person.avro, which contains Person records. The query converts the records to a CSV text file in which each line contains the full_name and age values:

import module "oxh:avro";
import module "oxh:text";
 
for $x in avro:collection-avroxml("person.avro")
return
   text:put($x/full_name || "," || $x/age)
 

Null values are converted to nilled elements. A nilled element has an xsi:nil attribute set to true; it is always empty. You can use the XQuery fn:nilled function to test if a record field is null. For example, the following query writes the name of Person records that have a null value for age:

import module "oxh:avro";
import module "oxh:text";
 
for $x in avro:collection-avroxml("person.avro")
where $x/age/nilled()
return
   text:put($x/full_name)
 

For nested records, the fields of the inner schema become child elements of the element that corresponds to the field in the outer schema. For example, this schema has a nested record:

{
   "type": "record", 
   "name": "PersonAddress",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "address", "type": 
         { "type" : "record",
           "name" : "Address",
           "fields" : [
              { "name" : "street", "type" : "string" },
              { "name" : "city", "type" : "string" }
            ]
         }
      }
   ]
}

This is an instance of the record as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <address>
      <street>123 First St.</street>
      <city>New York</city>
   </address>
</oxh:item>

The following example queries an Avro container file named people-address.avro that contains PersonAddress records, and writes the names of the people that live in New York to a text file:

import module "oxh:avro";
import module "oxh:text";

for $person in avro:collection-avroxml("examples/person-address.avro")
where $person/address/city eq "New York"
return
   text:put($person/full_name)

Reading Maps

Avro map values are converted to an element that contains one child <oxh:entry> element for each entry in the map. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonProperties",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "properties", "type": 
         {"type": "map", "values": "string"}
      }
   ]
}

This is an instance of the schema as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <properties>
      <oxh:entry key="employer">Example Inc</oxh:entry>
      <oxh:entry key="hair color">brown</oxh:entry>
      <oxh:entry key="favorite author">George RR Martin</oxh:entry>
   </properties>
</oxh:item>

The following example queries a file named person-properties.avro that contains PersonAddress records, and writes the names of the people that are employed by Example Inc. The query shows how regular XPath expressions can retrieve map entries. Moreover, you can use the avro:get function as a shortcut to retrieve map entries.

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-properties.avro")
where $person/properties/oxh:entry[@key eq "employer"] eq "Example Inc"
return
   text:put($person/full_name)
 

The following query uses the avro:get function to retrieve the employer entry. It is equivalent to the previous query.

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-properties.avro")
where $person/properties/avro:get("employer") eq "Example Inc"
return
   text:put($person/full_name)

You can use XQuery fn:nilled function to test for null values. This example returns true if the map entry is null:

$var/avro:get("key")/nilled()

Reading Arrays

Oracle XQuery for Hadoop converts Avro array values to an element that contains a child <oxh:item> element for each item in the array. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonScores",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "scores", "type": 
         {"type": "array", "items": "int"}
      }
   ]
}

This is an instance of the schema as XML:

<oxh:item>
   <full_name>John Doe</full_name>
   <scores>
      <oxh:item>128</oxh:item>
      <oxh:item>151</oxh:item>
      <oxh:item>110</oxh:item>
   </scores>
</oxh:item>

The following example queries a file named person-scores.avro that contains PersonScores records, and writes the sum and count of scores for each person:

import module "oxh:avro";
import module "oxh:text";
 
for $person in avro:collection-avroxml("person-scores.avro")
let $scores := $person/scores/*
return
   text:put($person/full_name || "," || sum($scores) || "," || count($scores))

You can access a specific element of an array by using a numeric XPath predicate. For example, this path expression selects the second score. XPath indexing starts at 1 (not 0).

$person/scores/oxh:item[2]

Reading Unions

Oracle XQuery for Hadoop converts an instance of an Avro union type based on the actual member type of the value. The name of the member type is added as an XML avro:type attribute to the enclosing element, which ensures that queries can distinguish between instances of different member types. However, the attribute is not added for trivial unions where there are only two member types and one of them is null.

For example, consider the following union of two records:

[
   {
      "type": "record", 
      "name": "Person1",
      "fields" : [
         {"name": "full_name", "type": "string"}
      ]
   }
   ,
   {
      "type": "record", 
      "name": "Person2",
      "fields" : [
         {"name": "fname", "type": "string"}
      ]
   }
]
 

This is an instance of the schema as XML:

<oxh:item avro:type="Person2">
   <fname>John Doe</fname>
</oxh:item>

The following example queries a file named person-union.avro that contains instances of the previous union schema, and writes the names of the people from both record types to a text file:

import module "oxh:avro";
import module "oxh:text";

for $person in avro:collection-avroxml("examples/person-union.avro")
return
   if ($person/@avro:type eq "Person1") then
      text:put($person/full_name)
   else if ($person/@avro:type eq "Person2") then
      text:put($person/fname)
   else
      error(xs:QName("UNEXPECTED"), "Unexpected record type:" || $person/@avro:type)

Reading Primitives

Table 6-1 shows how Oracle XQuery for Hadoop maps Avro primitive types to XQuery atomic types.

Table 6-1 Mapping Avro Primitive Types to XQuery Atomic Types

Avro XQuery

boolean

xs:boolean

int

xs:int

long

xs:long

float

xs:float

double

xs:double

bytes

xs:hexBinary

string

xs:string


Avro null values are mapped to empty nilled elements. To distinguish between a null string value and an empty string value, use the XQuery nilled function. This path expression only returns true if the field value is null:

$record/field/fn:nilled()

Avro fixed values are mapped to xs:hexBinary, and enums are mapped to xs:string.


Writing XML as Avro

Both the Avro file adapter and the Oracle NoSQL Database adapter have an avroxml method, which you can use with the put functions to write XML as Avro. The following topics describe how the XML is converted to an Avro instance:

Writing Records

Oracle XQuery for Hadoop maps the XML to an Avro record schema by matching the child element names to the field names of the record. For example, consider the following Avro schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

You can use the following XML element to write an instance of this record in which the full_name field is John Doe and the age field is 46. The name of the root element (Person) is inconsequential. Only the names of the child elements are used to map to the Avro record fields (full_name and age).

<person>
   <full_name>John Doe</full_name>
   <age>46</age>
</person>

The next example uses the following CSV file named people.csv:

John Doe,46
Jane Doe,37
     .
     .
     .
 

This query converts values from the CSV file to Avro Person records:

import module "oxh:avro";
import module "oxh:text";
 
declare 
   %avro:put("avroxml")
   %avro:schema('
      {
         "type": "record", 
         "name": "Person",
         "fields" : [
            {"name": "full_name", "type": "string"},
            {"name": "age", "type": ["int", "null"] }
         ]
      }
   ')
function local:put-person($person as element()) external;

for $line in text:collection("people.csv")
let $split := tokenize($line, ",")
return
   local:put-person(
      <person>
         <full_name>{$split[1]}</full_name>
         <age>{$split[2]}</age>
      </person>
   )
 

For null values, you can omit the element or set the xsi:nil="true" attribute. For example, this modified query sets age to null when the value is not numeric:

   .
   .
   .
for $line in text:collection("people.csv")
let $split := tokenize($line, ",")
return
   local:put-person(
      <person>
         <full_name>{$split[1]}</full_name>
         {
         if ($split[2] castable as xs:int) then
            <age>{$split[2]}</age>
         else 
            ()
         }
      </person>
   )

In the case of nested records, the values are obtained from nested elements. The next example uses the following schema:

{
"type": "record", 
"name": "PersonAddress",
"fields" : [
   {"name": "full_name", "type": "string"},
   {"name": "address", "type": 
      { "type" : "record",
        "name" : "Address",
        "fields" : [
           { "name" : "street", "type" : "string" },
           { "name" : "city", "type" : "string" }
        ]
      }
   }
]
}

You can use following XML to write an instance of this record:

<person>
   <full_name>John Doe</full_name>
   <address>
      <street>123 First St.</street>
      <city>New York</city>
   </address>
</person>

Writing Maps

Oracle XQuery for Hadoop converts XML to an Avro map with one map entry for each <oxh:entry> child element. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonProperties",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "properties", "type": 
         {"type": "map", "values": "string"}
      }
   ]
}

You can use the following XML element to write an instance of this schema in which the full_name field is John Doe, and the properties field is set to a map with three entries:

<person>
   <full_name>John Doe</full_name>
   <properties>
      <oxh:entry key="hair color">brown</oxh:entry>
      <oxh:entry key="favorite author">George RR Martin</oxh:entry>
      <oxh:entry key="employer">Example Inc</oxh:entry>
   </properties>
</person>

Writing Arrays

Oracle XQuery for Hadoop converts XML to an Avro array with one item for each <oxh:item> child element. For example, consider the following schema:

{
   "type": "record", 
   "name": "PersonScores",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "scores", "type": 
         {"type": "array", "items": "int"}
      }
   ]
}

You can use the following XML element to write an instance of this schema in which the full_name field is John Doe and the scores field is set to [128, 151, 110]:

<person>
   <full_name>John Doe</full_name>
   <scores>
      <oxh:item>128</oxh:item>
      <oxh:item>151</oxh:item>
      <oxh:item>110</oxh:item>
   </scores>
</person>

Writing Unions

When writing an Avro union type, Oracle XQuery for Hadoop bases the selection of a member type on the value of the avro:type attribute.

This example uses the following schema:

[
   {
      "type": "record", 
      "name": "Person1",
      "fields" : [
         {"name": "full_name", "type": "string"}
      ]
   }
   ,
   {
      "type": "record", 
      "name": "Person2",
      "fields" : [
         {"name": "fname", "type": "string"}
      ]
   }
]

The following XML is mapped to an instance of the Person1 record:

<person avro:type="Person1">
   <full_name>John Doe</full_name>
</person>

This XML is mapped to an instance of the Person2 record:

<person avro:type="Person2">
   <fname>John Doe</fname>
</person>

The avro:type attribute selects the member type of the union. For trivial unions that contain a null and one other type, the avro:type attribute is unnecessary. If the member type cannot be determined, then an error is raised.


Writing Primitives

To map primitive values, Oracle XQuery for Hadoop uses the equivalent data types shown in Table 6-1 to cast an XML value to the corresponding Avro type. If the value cannot be converted to the Avro type, then an error is raised.

This example uses the following schema:

{
   "type": "record",
   "name": "Person",
   "fields" : [
      {"name": "full_name", "type": "string"},
      {"name": "age", "type": ["int", "null"] }
   ]
}

Attempting to map the following XML to an instance of this schema raises an error, because the string value apple cannot be converted to an int:

<person>
   <full_name>John Doe</full_name>
   <age>apple</age>
</person>