Oracle XML Extensions for Hive

7 Oracle XML Extensions for Hive

This chapter explains how to use the XML extensions for Apache Hive provided with Oracle XQuery for Hadoop. The chapter contains the following sections:

What are the XML Extensions for Hive?

The XML Extensions for Hive provide XML processing support that enables you to do the following:

Query large XML files in HDFS as Hive tables
Query XML strings in Hive tables
Query XML file resources in the Hadoop distributed cache
Efficiently extract atomic values from XML without using expensive DOM parsing
Retrieve, generate, and transform complex XML elements
Generate multiple table rows from a single XML value
Manage missing and dirty data in XML

The XML extensions also support these W3C modern standards:

XQuery 3.1
XQuery Update Facility 1.0 (transform expressions)
XPath 3.1
XML Schema 1.0
XML Namespaces

The XML extensions have two components:

XML InputFormat and SerDe for creating XML tables

See "Create XML Tables."
XML function library

See "About the Hive Functions."

Use the Hive Extensions From the Command Line

To enable the Oracle XQuery for Hadoop extensions, use the --auxpath and -i arguments when starting Hive:

$ hive --auxpath \
$OXH_HOME/hive/lib/oxh-hive.jar,\
$OXH_HOME/hive/lib/oxh-mapreduce.jar,\
$OXH_HOME/hive/lib/oxquery.jar,\
$OXH_HOME/hive/lib/xqjapi.jar,\
$OXH_HOME/hive/lib/apache-xmlbeans.jar,\
$OXH_HOME/hive/lib/woodstox-core-asl-*.jar,\
$OXH_HOME/hive/lib/stax2-api-*.jar \
-i $OXH_HOME/hive/init.sql

Note:

On the Oracle BigDataLite VM, HIVE_AUX_JARS_PATH contains the Hive extensions by default and hence specifying --auxpath is unnecessary.

The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC, loads one row into it, and calls the xml_query function.

To verify that the extensions are accessible:

Start the Hive command-line interface (CLI):

$ hive --auxpath \
$OXH_HOME/hive/lib/oxh-hive.jar,\
$OXH_HOME/hive/lib/oxh-mapreduce.jar,\
$OXH_HOME/hive/lib/oxquery.jar,\
$OXH_HOME/hive/lib/xqjapi.jar,\
$OXH_HOME/hive/lib/apache-xmlbeans.jar,\
$OXH_HOME/hive/lib/woodstox-core-asl-*.jar,\
$OXH_HOME/hive/lib/stax2-api-*.jar \
-i $OXH_HOME/hive/init.sql

The init.sql file contains the CREATE TEMPORARY FUNCTION statements that declare the XML functions.

Call an Oracle XQuery for Hadoop function for Hive. This example calls the xml_query function to parse an XML string:
```
hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>");
     .
     .
     .
["123"]
```

If the extensions are accessible, then the query returns ["123"], as shown in the example.

Use the Hive Extensions in HiveServer2

To enable the XML Extensions for Hive in HiveServer2, you must add the following JARs to the environment variable HIVE_AUX_JARS_PATH and to the configuration property hive.aux.jars.path.

$OXH_HOME/hive/lib/woodstox-core-asl-*.jar            
$OXH_HOME/hive/lib/apache-xmlbeans.jar            
$OXH_HOME/hive/lib/oxh-hive.jar            
$OXH_HOME/hive/lib/oxh-mapreduce.jar           
$OXH_HOME/hive/lib/stax2-api-*.jar            
$OXH_HOME/hive/lib/xqjapi.jar            
$OXH_HOME/hive/lib/oxquery.jar

If you are using Cloudera's distribution including Apache Hadoop (on Oracle Big Data Appliance or commodity hardware) you can add the JARs as follows:

In Cloudera Manager, click hive:
Click Configuration to go to the Configuration tab.
In the search field, enter hive_aux. Then, add the following text after whatever value is in Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hive-env.sh . Be sure to include the leading comma. It is advisable to save the text of the original path before editing this field in case you need to revert at a later time.
```
,/opt/oracle/oxh-*/hive/lib/woodstox-core-asl-oxh-*.jar,/opt/oracle/oxh-*/hive/lib/apache-xmlbeans.jar,/opt/oracle/oxh-*/hive/lib/oxh-hive.jar,/opt/oracle/oxh-*/hive/lib/oxh-mapreduce.jar,/opt/oracle/oxh-*/hive/lib/stax2-api-*.jar,/opt/oracle/oxh-*/hive/lib/xqjapi.jar,/opt/oracle/oxh-*/hive/lib/oxquery.jar
```
Note that the JAR version numbers in the paths above are wildcarded (*). Replace the wildcards with the correct JAR versions for the OXH release you are working with.
Search for hive.aux.jars.path. Add the following text after whatever value is in Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml . Be sure to add the text to the value field in the Gateway Default Group panel:
Click Save Changes.
Click Actions and then click Deploy Client Configuration.
After the service restarts, check the status to ensure that it is working correctly.
To verify that the extensions are accessible, use Beeline:
1. Start Beeline:
```
$ beeline -i $OXH_HOME/hive/init.sql
```
  Note:
  You can use init.sql as shown above to temporarily declare the functions xml_query, xml_table, et cetera. You can also enable the functions permanently as described in Permanently Declaring the Hive Functions.
2. At the Beeline CLI, enter the following. (Fill in the placeholders host andport with the host and port of hiveserver2.)
```
beeline> !connect jdbc:hive2: //[host]:[port] 
```
  When prompted for a username and password, you can press Enter.
3. All the XML extension functions should be created when you connected to Hive. Try the following command to test if they are working correctly:
```
beeline> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>");
```
  The results returned should be similar to this example:
```
+----------+--+
|   _c0    |
+----------+--+
| ["123"]  |
+----------+--+
```
If this test is successful, then you are ready to use the Oracle XQuery for Hadoop functions for Hive in HiveServer2.

See Also:

Permanently Declaring the Hive Functions.

You can use init.sql to temporarily declare the functions, as shown in previous examples. You can also permanently declare them.

About the Hive Functions

The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:

xml_query: Returns the result of a query as an array of STRING values.
xml_query_as_primitive: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it.
xml_exists: Tests if the result of a query is empty
xml_table: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.

See "Oracle XML Functions for Hive Reference."

Permanently Declaring the Hive Functions

In the examples in the previous section, $OXH_HOME/hive/init.sql is used to temporarily declare the XML extensions for Hive functions. However, as an alternative, you can permanently declare the functions so that init.sql is not needed. Use the following commands to permanently declare the functions.

CREATE FUNCTION xml_query AS 'oracle.hadoop.xquery.hive.OXMLQueryUDF' ;
CREATE FUNCTION xml_query_as_bigint AS 'oracle.hadoop.xquery.hive.OXMLQueryBigintUDF' ;
CREATE FUNCTION xml_query_as_int AS 'oracle.hadoop.xquery.hive.OXMLQueryIntUDF' ;
CREATE FUNCTION xml_query_as_smallint AS 'oracle.hadoop.xquery.hive.OXMLQuerySmallintUDF' ;
CREATE FUNCTION xml_query_as_tinyint AS 'oracle.hadoop.xquery.hive.OXMLQueryTinyintUDF' ;
CREATE FUNCTION xml_query_as_float AS 'oracle.hadoop.xquery.hive.OXMLQueryFloatUDF' ;
CREATE FUNCTION xml_query_as_double AS 'oracle.hadoop.xquery.hive.OXMLQueryDoubleUDF' ';
CREATE FUNCTION xml_query_as_boolean AS 'oracle.hadoop.xquery.hive.OXMLQueryBooleanUDF' ;
CREATE FUNCTION xml_query_as_string AS 'oracle.hadoop.xquery.hive.OXMLQueryStringUDF' ;
CREATE FUNCTION xml_exists AS 'oracle.hadoop.xquery.hive.OXMLExists' ;
CREATE FUNCTION xml_table AS 'oracle.hadoop.xquery.hive.OXMLTableUDTF' ;

Create XML Tables

This section describes how you can use the Hive CREATE TABLE statement to create tables over large XML documents.

Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.

To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.

Hive CREATE TABLE Syntax for XML Tables

The following is the basic syntax of the Hive CREATE TABLE statement for creating a Hive table over XML files:

CREATE TABLE table_name (columns)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(configuration)

Parameters

Parameter	Description
columns	All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions."
configuration	Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas.

Note:

Inserting data into XML tables is not supported.

CREATE TABLE Configuration Properties

Use these configuration properties in the configuration parameter of the CREATE TABLE command.

oxh-default-namespace

Sets the default namespace for expressions in the table definition and for XML parsing. The value is a URI.

This example defines the default namespace:

"oxh-default-namespace" = "http://example.com/foo"

oxh-charset

Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.

All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.

This example defines the character set:

"oxh-charset" = "ISO-8859-1"

oxh-column.name

Specifies how an element selected by the oxh-elements property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.

Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.

Any column of the table that does not have a corresponding oxh-column property behaves as if the following property is specified:

"oxh-column.name" = "(./name | ./@name)[1]"

Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."

oxh-elements

Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.

This example maps each element named foo in the XML to a single row in the Hive table:

"oxh-elements" = "foo"

The next example maps each element named either foo or bar in the XML to a row in the Hive table:

"oxh-elements" = "foo, bar"

oxh-entity.name

Defines a set of entity reference definitions.

In the following example, entity references in the XML are expanded from &foo; to "foo value" and from &bar; to "bar value".

"oxh-entity.foo" = "foo value"
"oxh-entity.bar" = "bar value"

oxh-namespace.prefix

Defines a namespace binding.

This example binds the prefix myns to the namespace http://example.org:

"oxh-namespace.myns" = "http://example.org"

You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element and oxh-column property values can also reference them.

In the following example, only foo elements in the http://example.org namespace are mapped to table rows:

"oxh-namespace.myns" = "http://example.org",
"oxh-elements" = "myns:foo",
"oxh-column.bar" = "./myns:bar"

CREATE TABLE Examples

This section includes the following examples:

Syntax Example

This example shows how to map XML elements to column names.

Example 7-1 Basic Column Mappings

In the following table definition, the oxh-elements property specifies that each element named foo in the XML is mapped to a single row in the table. The oxh-column properties specify that a Hive table column named BAR gets the value of the child element named bar converted to STRING, and the column named ZIP gets the value of the child element named zip converted to INT.

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "./bar",
   "oxh-column.zip" = "./zip"
)

Example 7-2 Conditional Column Mappings

In this modified definition of the ZIP column, the column receives a value of -1 if the foo element does not have a child zip element, or if the zip element contains a nonnumeric value:

"oxh-column.zip" = "
   if (./zip castable as xs:int) then 
      xs:int(./zip) 
   else 
      -1
"

Example 7-3 Default Column Mappings

The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR and ZIP columns.

Table Definition 1

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "(./bar | ./@bar)[1]",
   "oxh-column.zip" = "(./zip | ./@zip)[1]"
)

Table Definition 2

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo"
)

Simple Examples

These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment element in the document (comments.xml) has one or more like elements that indicate that the user liked the comment.

<comments>
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>

In the CREATE TABLE examples, the comments.xml input file is in the current working directory of the local file system.

Example 7-4 Creating a Table

The following Hive CREATE TABLE command creates a table named COMMENTS with a row for each comment containing the user names, text, and number of likes:

hive>
CREATE TABLE comments (usr STRING, content STRING, likeCt INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.usr" = "./@user",
   "oxh-column.content" = "./@text",
   "oxh-column.likeCt" = "fn:count(./like)"
);

The Hive LOAD DATA command loads comments.xml into the COMMENTS table.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments;
]

The following query shows the content of the COMMENTS table.

hive> SELECT usr, content, likeCt FROM comments;
     .
     .
     .
john  It is raining :(     0
kelly I won the lottery!   2
mike  Happy New Year!      1

Example 7-5 Querying an XML Column

This CREATE TABLE command is like the previous example, except that the like elements are produced as XML in a STRING column.

hive>
CREATE TABLE comments2 (usr STRING, content STRING, likes STRING)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.usr" = "./@user",
   "oxh-column.content" = "./@text",
   "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)"
);

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;

The following query shows the content of the COMMENTS2 table.

hive> SELECT usr, content, likes FROM comments2;
     .
     .
     .
john   It is raining :(    <likes/>
kelly  I won the lottery!  <likes><like user="john"/><like user="mike"/></likes>
mike   Happy New Year!     <likes><like user="laura"/></likes>

The next query extracts the user names from the like elements:

hive> SELECT usr, t.user FROM comments2 LATERAL VIEW 
    > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user;
     .
     .
     .
kelly  john
kelly  mike
mike   laura

Example 7-6 Generating XML in a Single String Column

This command creates a table named COMMENTS3 with a row for each comment, and produces the XML in a single STRING column.

hive> 
CREATE TABLE comments3 (xml STRING)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.xml" = "fn:serialize(.)"
   );

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;

The following query shows the contents of the XML column:

hive> SELECT xml FROM comments3;
     .
     .
     .
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
   <like user="john"/>
   <like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
   <like user="laura"/>
</comment>

The next query extracts the IDs and converts them to integers:

hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
     .
     .
     .
12345
56789
54321

OpenStreetMap Examples

These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node, way, and relation elements.

In these examples, the OpenStreetMap XML files are stored in the /user/name/osm HDFS directory.

See Also:

To download OpenStreetMap data, go to

http://www.openstreetmap.org/export
For information about the OpenStreetMap XML format, go to

http://wiki.openstreetmap.org/wiki/OSM_XML

Example 7-7 Creating a Table Over OpenStreetMap XML

This example creates a table over OpenStreetMap XML with one row for each node element as follows:

The id, lat, lon, and user attributes of the node element are mapped to table columns.
The year is extracted from the timestamp attribute and mapped to the YEAR column. If a node does not have a timestamp attribute, then -1 is used for the year.
If the node element has any child tag elements, then they are stored as an XML string in the TAGS column. If node has no child tag elements, then column value is NULL.

hive>
CREATE EXTERNAL TABLE nodes (
   id BIGINT,
   latitude DOUBLE,
   longitude DOUBLE,
   year SMALLINT,
   tags STRING
) 
ROW FORMAT 
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
LOCATION '/user/name/osm'
TBLPROPERTIES (
   "oxh-elements" = "node",
   "oxh-column.id" = "./@id",
   "oxh-column.latitude" = "./@lat",
   "oxh-column.longitude" = "./@lon",
   "oxh-column.year" = "
      if (fn:exists(./@timestamp)) then
         fn:year-from-dateTime(xs:dateTime(./@timestamp))
      else
         -1
   ",
   "oxh-column.tags" = "
      if (fn:exists(./tag)) then
         fn:serialize(<tags>{./tag}</tags>)
      else 
         ()
   "
);

The following query returns the number of nodes per year:

hive> SELECT year, count(*) FROM nodes GROUP BY year;

This query returns the total number of tags across nodes:

hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;

Example 7-8

In OpenStreetMap XML, the node, way, and relation elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node, way, and relation element.

hive>
  CREATE EXTERNAL TABLE osm (
     id BIGINT,
     uid BIGINT,
     type STRING
  ) 
  ROW FORMAT 
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  LOCATION '/user/name/osm'
  TBLPROPERTIES (
    "oxh-elements" = "node, way, relation",
    "oxh-column.id" = "./@id",
    "oxh-column.uid" = "./@uid",
    "oxh-column.type" = "./name()"
  );

The following query returns the number of node, way, and relation elements. The TYPE column is set to the name of the selected element, which is either node, way, or relation.

hive> SELECT type, count(*) FROM osm GROUP BY type;

This query returns the number of distinct user IDs:

hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;

See Also:

For a description of the OpenStreetMap elements and attributes, go to

http://wiki.openstreetmap.org/wiki/Elements

Oracle XML Functions for Hive Reference

This section describes the Oracle XML Extensions for Hive. It describes the following commands and functions:

Data Type Conversions

This table shows the conversions that occur automatically between Hive primitives and XML schema types.

Table 7-1 Data Type Equivalents

Hive	XML schema
`TINYINT`	`xs:byte`
`SMALLINT`	`xs:short`
`INT`	`xs:int`
`BIGINT`	`xs:long`
`BOOLEAN`	`xs:boolean`
`FLOAT`	`xs:float`
`DOUBLE`	`xs:double`
`STRING`	`xs:string`

Hive Access to External Files

The Hive functions have access to the following external file resources:

XML schemas

See http://www.w3.org/TR/xquery/#id-schema-import
XML documents

See http://www.w3.org/TR/xpath-functions/#func-doc
XQuery library modules

See http://www.w3.org/TR/xquery/#id-module-import

You can address these files by their URI from either HTTP (by using the http://... syntax) or the local file system (by using the file://... syntax). In this example, relative file locations are resolved against the local working directory of the task, so that URIs such as bar.xsd can be used to access files that were added to the distributed cache:

xml_query("
   import schema namespace tns='http://example.org' at 'bar.xsd';
   validate { ... }
        ",
   .
   .
   .

To access a local file, first add it to the Hadoop distributed cache using the Hive ADD FILE command. For example:

ADD FILE /local/mydir/thisfile.xsd;

Otherwise, you must ensure that the file is available on all nodes of the cluster, such as by mounting the same network drive or simply copying the file to every node. The default base URI is set to the local working directory.

See Also:

For examples of accessing the distributed cache, see Example 7-15 for xml_query, Example 7-22 for xml_query_as_primitive, and Example 7-31 for xml_table.
For information about the default base URI, see XQuery 3.1: An XML Query Language at

http://www.w3.org/TR/xquery/#dt-base-uri

Online Documentation of Functions

You can get online Help for the Hive extension functions by using this command:

DESCRIBE FUNCTION [EXTENDED] function_name;

This example provides a brief description of the xml_query function:

hive> describe function xml_query;         
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array

The EXTENDED option provides a detailed description and examples:

hive> describe function extended xml_query;
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array
Evaluates an XQuery expression with the specified bindings. The query argument must be a STRING and the bindings argument must be a STRING or a STRUCT. If the bindings argument is a STRING, it is parsed as XML and bound to the initial context item of the query. For example:
  
  > SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
  ["hello", "world"]
     .
     .
     .

xml_exists

Tests if the result of a query is empty.

Signature

xml_exists(
    STRING query, 
    { STRING | STRUCT } bindings
) as BOOLEAN

Description

query

An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.

You can access files that are stored in the Hadoop distributed cache and HTTP resources (http://...). Use the XQuery fn:doc function for XML documents, and the fn:unparsed-text and fn:parsed-text-lines functions to access plain text files.

If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.

bindings

The input that the query processes. The value can be an XML STRING or a STRUCT of variable values:

STRING: The string is bound to the initial context item of the query as XML.
STRUCT: A STRUCT with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING, and the value fields can be any supported primitive. See "Data Type Conversions."

Return Value

true if the result of the query is not empty; false if the result is empty or the query raises a dynamic error

Notes

The first dynamic error raised by a query is logged, but subsequent errors are suppressed.

Examples

Example 7-9 STRING Binding

This example parses and binds the input XML string to the initial context item of the query x/y:

Hive> SELECT xml_exists("x/y", "<x><y>123</y></x>") FROM src LIMIT 1;
     .
     .
     .
true

Example 7-10 STRUCT Binding

This example defines two query variables, $data and $value:

Hive> SELECT xml_exists(
      "parse-xml($data)/x/y[@id = $value]",
      struct(
         "data", "<x><y id='1'/><y id='2'/></x>",
         "value", 2
      )
   ) FROM src LIMIT 1;
     .
     .
     .
true

Example 7-11 Error Logging

In this example, an error is written to the log, because the input XML is invalid:

hive> SELECT xml_exists("x/y", "<x><y>123</invalid></x>") FROM src LIMIT 1;
     .
     .
     .
false

xml_query

Returns the result of a query as an array of STRING values.

Signature

xml_query(
   STRING query, 
   { STRING | STRUCT } bindings
) as ARRAY<STRING>

Description

query

An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.

If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.

bindings

The input that the query processes. The value can be an XML STRING or a STRUCT of variable values:

STRING: The string is bound to the initial context item of the query as XML. See Example 7-12.
STRUCT: A STRUCT with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-13.

Return Value

A Hive array of STRING values, which are the result of the query converted to a sequence of atomic values. If the result of the query is empty, then the return value is an empty array.

Examples

Example 7-12 Using a STRING Binding

This example parses and binds the input XML string to the initial context item of the query x/y:

hive> 
SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") 
FROM src LIMIT 1;
     .
     .
     .
["hello","world"]

Example 7-13 Using a STRUCT Binding

In this example, the second argument is a STRUCT that defines two query variables, $data and $value. The values of the variables in the STRUCT are converted to XML schema types as described in "Data Type Conversions."

hive>
SELECT xml_query(
   "fn:parse-xml($data)/x/y[@id = $value]", 
   struct(
      "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>",
      "value", 1
   )
) FROM src LIMIT 1;
     .
     .
     .
["hello"]

Example 7-14 Obtaining Serialized XML

This example uses the fn:serialize function to return serialized XML:

hive> 
SELECT xml_query(
"for $y in x/y 
return fn:serialize($y)
",
"<x><y>hello</y><z/><y>world</y></x>"
) FROM src LIMIT 1;
     .
     .
     .
["<y>hello</y>","<y>world</y>"]

Example 7-15 Accessing the Hadoop Distributed Cache

This example adds a file named test.xml to the distributed cache, and then queries it using the fn:doc function. The file contains this value:

<x><y>hello</y><z/><y>world</y></x>

hive> ADD FILE test.xml;
Added resource: test.xml
hive> SELECT xml_query("fn:doc('test.xml')/x/y", NULL) FROM src LIMIT 1;
     .
     .
     .
["hello","world"]

Example 7-16 Results of a Failed Query

The next example returns an empty array because the input XML is invalid. The XML parsing error will be written to the log:

hive> SELECT xml_query("x/y", "<x><y>hello</y></invalid") FROM src LIMIT 1;
     .
     .
     .
[]

xml_query_as_primitive

Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it:

xml_query_as_string
xml_query_as_boolean
xml_query_as_tinyint
xml_query_as_smallint
xml_query_as_int
xml_query_as_bigint
xml_query_as_double
xml_query_as_float

Signature

xml_query_as_primitive (
   STRING query,
   {STRUCT | STRING} bindings,
} as primitive

Description

query

An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.

If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.

bindings

The input that the query processes. The value can be an XML STRING or a STRUCT of variable values:

STRING: The string is bound to the initial context item of the query as XML. See Example 7-17.
STRUCT: A STRUCT with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-18.

The first item in the result of the query is cast to the XML schema type that maps to the primitive type of the function. If the query returns multiple items, then all but the first are ignored.

Return Value

A Hive primitive value, which is the first item returned by the query, converted to an atomic value. If the result of the query is empty, then the return value is NULL.

Examples

Example 7-17 Using a STRING Binding

This example parses and binds the input XML string to the initial context item of the query x/y:

hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y></x>") FROM src LIMIT 1;
     .
     .
     .
"hello"

The following are string binding examples that use other primitive functions:

hive> SELECT xml_query_as_int("x/y", "<x><y>123</y></x>") FROM src LIMIT 1;
     .
     .
     .
123
 
hive> SELECT xml_query_as_double("x/y", "<x><y>12.3</y></x>") FROM src LIMIT 1;
     .
     .
     .
12.3
 
hive> SELECT xml_query_as_boolean("x/y", "<x><y>true</y></x>") FROM src LIMIT 1;
     .
     .
     .
true

Example 7-18 Using a STRUCT Binding

hive>
SELECT xml_query_as_string(
   "fn:parse-xml($data)/x/y[@id = $value]", 
   struct(
      "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>",
      "value", 2
   )
) FROM src LIMIT 1;
     .
     .
     .
world

Example 7-19 Returning Multiple Query Results

This example returns only the first item (hello) from the query. The second item (world) is discarded.

hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
     .
     .
     .
hello

Example 7-20 Returning Empty Query Results

This example returns NULL because the result of the query is empty:

hive> SELECT xml_query_as_string("x/foo", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
     .
     .
     .
NULL

Example 7-21 Obtaining Serialized XML

These examples use the fn:serialize function to return complex XML elements as a STRING value:

hive> SELECT xml_query_as_string("fn:serialize(x/y[1])", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
     .
     .
     .
"<y>hello</y>"

hive> SELECT xml_query_as_string(
   "fn:serialize(<html><head><title>{$desc}</title></head><body>Name: {$name}</body></html>)", 
   struct(
      "desc", "Employee Details",
      "name", "John Doe"
   )
) FROM src LIMIT 1;
...
<html><head><title>Employee Details</title></head><body>Name: John Doe</body></html>

Example 7-22 Accessing the Hadoop Distributed Cache

This example adds a file named test.xml to the distributed cache, and then queries it using the fn:doc function. The file contains this value:

<x><y>hello</y><z/><y>world</y></x>

Hive> ADD FILE test.xml;
Added resource: test.xml
Hive> SELECT xml_query_as_string("fn:doc('test.xml')/x/y[1]", NULL) FROM src LIMIT 1;
     .
     .
     .
hello

Example 7-23 Results of a Failed Query

This example returns NULL because </invalid is missing an angle bracket. An XML parsing error is written to the log:

Hive> SELECT xml_query_as_string("x/y", "<x><y>hello</invalid") FROM src LIMIT 1;
     .
     .
     .
NULL

This example returns NULL because foo cannot be cast as xs:float. A cast error is written to the log:

Hive> SELECT xml_query_as_float("x/y", "<x><y>foo</y></x>") FROM src LIMIT 1;
     .
     .
     .
NULL

xml_table

A user-defined table-generating function (UDTF) that maps an XML value to zero or more table rows. This function enables nested repeating elements in XML to be mapped to Hive table rows.

Signature

xml_table( 
   STRUCT? namespaces,
   STRING query, 
   {STRUCT | STRING} bindings,
   STRUCT? columns
)

Description

namespaces

Identifies the namespaces that the query and column expressions can use. Optional.

The value is a STRUCT with an even number of STRING fields. Each pair of fields defines a namespace binding (prefix, URI) that can be used by the query or the column expressions. See Example 7-26.

query

An XQuery or XPath expression that generates a table row for each returned value. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.

If a dynamic error occurs during query processing, then the function does not raise an error, but logs it the first time. Subsequent dynamic errors are not logged.

bindings

The input that the query processes. The value can be an XML STRING or a STRUCT of variable values:

STRING: The string is bound to the initial context item of the query as XML. See Example 7-24.
STRUCT: A STRUCT with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING, and the value fields can be any supported primitive. See "Data Type Conversions."

columns

The XQuery or XPath expressions that define the columns of the generated rows. Optional.

The value is a STRUCT that contains the additional XQuery expressions. The XQuery expressions must be constant STRING values, because they are only read the first time the function is evaluated. For each column expression in the STRUCT, there is one column in the table.

For each item returned by the query, the column expressions are evaluated with the current item as the initial context item of the expression. The results of the column expressions are converted to STRING values and become the values of the row.

If the result of a column expression is empty or if a dynamic error occurs while evaluating the column expression, then the corresponding column value is NULL. If a column expression returns more than one item, then all but the first are ignored.

Omitting the columns argument is the same as specifying 'struct(".")'. See Example 7-25.

Return Value

One table row for each item returned by the query argument.

Notes

The XML table adapter enables Hive tables to be created over large XML files in HDFS. See "Hive CREATE TABLE Syntax for XML Tables".

Examples

Note:

You could use the xml_query_as_string function to achieve the same result in this example. However, xml_table is more efficient, because a single function call sets all three column values and parses the input XML only once for each row. The xml_query_as_string function requires a separate function call for each of the three columns and reparses the same input XML value each time.

Example 7-24 Using a STRING Binding

The query "x/y" returns two <y> elements, therefore two table rows are generated. Because there are two column expressions ("./z", "./w"), each row has two columns.

hive> SELECT xml_table(
   "x/y",
   "<x>
      <y>
         <z>a</z>
         <w>b</w>
      </y>
      <y>
         <z>c</z>
      </y>
   </x>
   ",
   struct("./z", "./w")
 ) AS (z, w)
 FROM src;
     .
     .
     .
a        b
c        NULL

Example 7-25 Using the Columns Argument

The following two queries are equivalent. The first query explicitly specifies the value of the columns argument:

hive> SELECT xml_table(
      "x/y",
      "<x><y>hello</y><y>world</y></x>",
      struct(".")
   ) AS (y)
   FROM src;
     .
     .
     .
hello
world

The second query omits the columns argument, which defaults to struct("."):

hive> SELECT xml_table(
      "x/y",
      "<x><y>hello</y><y>world</y></x>"
   ) AS (y)
   FROM src;
     .
     .
     .
hello
world

Example 7-26 Using the Namespaces Argument

This example specifies the optional namespaces argument, which identifies an ns prefix and a URI of http://example.org.

hive> SELECT xml_table(
     struct("ns", "http://example.org"),
     "ns:x/ns:y",
     "<x xmlns='http://example.org'><y><z/></y><y><z/><z/></y></x>",
     struct("count(./ns:z)")
  ) AS (y)
  FROM src;
     .
     .
     .
1
2

Example 7-27 Querying a Hive Table of XML Documents

This example queries a table named COMMENTS3, which has a single column named XML_STR of type STRING. It contains these three rows:

hive> SELECT xml_str FROM comments3;

<comment id="12345" user="john" text="It is raining:("/>
<comment id="56789" user="kelly" text="I won the lottery!"><like user="john"/><like user="mike"/></comment>
<comment id="54321" user="mike" text="Happy New Year!"><like user="laura"/></comment>

The following query shows how to extract the user, text, and number of likes from the COMMENTS3 table.

hive> SELECT t.id, t.usr, t.likes
     FROM comments3 LATERAL VIEW xml_table(
        "comment",
        comments.xml_str, 
        struct("./@id", "./@user", "fn:count(./like)")
     ) t AS id, usr, likes;

12345   john    0
56789   kelly   2
54321   mike    1

Example 7-28 Mapping Nested XML Elements to Table Rows

This example shows how to use xml_table to flatten nested, repeating XML elements into table rows. See the previous example for the COMMENTS table.

> SELECT t.i, t.u, t.l
    FROM comments3 LATERAL VIEW xml_table (
       "let $comment := ./comment
        for $like in $comment/like
        return
           <r>
              <id>{$comment/@id/data()}</id>
              <user>{$comment/@user/data()}</user>
              <like>{$like/@user/data()}</like>
           </r>
       ",
       comments.xml_str,
       struct("./id", "./user", "./like")
    ) t AS i, u, l;

56789   kelly   john
56789   kelly   mike
54321   mike    laura

Example 7-29 Mapping Optional Nested XML Elements to Table Rows

This example is a slight modification of the previous example that produces a row even when a comment has no likes. See Example 7-27 for the COMMENTS table.

> SELECT t.i, t.u, t.l
     FROM comments3 LATERAL VIEW xml_table (
        "let $comment := ./comment
         for $like allowing empty in $comment/like
         return 
            <r>
               <id>{$comment/@id/data()}</id>
               <user>{$comment/@user/data()}</user>
               <like>{$like/@user/data()}</like>
            </r>
        ",
        comments.xml_str, 
        struct("./id", "./user", "./like")
     ) t AS i, u, l;

12345   john
56789   kelly   john
56789   kelly   mike
54321   mike    laura

Example 7-30 Creating a New View

You can create views and new tables using xml_table, the same as any table-generating function. This example creates a new view named COMMENTS_LIKES from the COMMENTS table:

hive> CREATE VIEW comments_likes AS 
     SELECT xml_table(
        "comment", 
         comments.xml_str, 
         struct("./@id", "count(./like)")
     ) AS (id, likeCt)
     FROM comments;

This example queries the new view:

> SELECT * FROM comments_likes
     WHERE CAST(likeCt AS INT) != 0;

56789   2
54321   1

Example 7-31 Accessing the Hadoop Distributed Cache

You can access XML documents and text files added to the distributed cache by using the fn:doc and fn:unparsed-text functions.

This example queries a file named test.xml that contains this string:

<x><y>hello</y><z/><y>world</y></x>

hive> ADD FILE test.xml;
Added resource: test.xml
hive> SELECT xml_table("fn:doc('test.xml')/x/y", NULL) AS y FROM src;
     .
     .
     .
hello
world