7 Oracle XML Extensions for Hive
This chapter explains how to use the XML extensions for Apache Hive provided with Oracle XQuery for Hadoop. The chapter contains the following sections:
What are the XML Extensions for Hive?
The XML Extensions for Hive provide XML processing support that enables you to do the following:
-
Query large XML files in HDFS as Hive tables
-
Query XML strings in Hive tables
-
Query XML file resources in the Hadoop distributed cache
-
Efficiently extract atomic values from XML without using expensive DOM parsing
-
Retrieve, generate, and transform complex XML elements
-
Generate multiple table rows from a single XML value
-
Manage missing and dirty data in XML
The XML extensions also support these W3C modern standards:
-
XQuery 3.1
-
XQuery Update Facility 1.0 (transform expressions)
-
XPath 3.1
-
XML Schema 1.0
-
XML Namespaces
The XML extensions have two components:
-
XML InputFormat and SerDe for creating XML tables
See "Create XML Tables."
-
XML function library
See "About the Hive Functions."
Use the Hive Extensions From the Command Line
To enable the Oracle XQuery for Hadoop extensions, use the --auxpath
and -i
arguments when starting Hive:
$ hive --auxpath \
$OXH_HOME/hive/lib/oxh-hive.jar,\
$OXH_HOME/hive/lib/oxh-mapreduce.jar,\
$OXH_HOME/hive/lib/oxquery.jar,\
$OXH_HOME/hive/lib/xqjapi.jar,\
$OXH_HOME/hive/lib/apache-xmlbeans.jar,\
$OXH_HOME/hive/lib/woodstox-core-asl-*.jar,\
$OXH_HOME/hive/lib/stax2-api-*.jar \
-i $OXH_HOME/hive/init.sql
Note:
On the Oracle BigDataLite VM, HIVE_AUX_JARS_PATH
contains the Hive extensions by default and hence specifying --auxpath
is unnecessary.
The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC
, loads one row into it, and calls the xml_query
function.
To verify that the extensions are accessible:
-
Log in to a server in the Hadoop cluster where you plan to work.
-
Start the Hive command-line interface (CLI):
$ hive --auxpath \ $OXH_HOME/hive/lib/oxh-hive.jar,\ $OXH_HOME/hive/lib/oxh-mapreduce.jar,\ $OXH_HOME/hive/lib/oxquery.jar,\ $OXH_HOME/hive/lib/xqjapi.jar,\ $OXH_HOME/hive/lib/apache-xmlbeans.jar,\ $OXH_HOME/hive/lib/woodstox-core-asl-*.jar,\ $OXH_HOME/hive/lib/stax2-api-*.jar \ -i $OXH_HOME/hive/init.sql
The
init.sql
file contains theCREATE TEMPORARY FUNCTION
statements that declare the XML functions. -
Call an Oracle XQuery for Hadoop function for Hive. This example calls the
xml_query
function to parse an XML string:hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>"); . . . ["123"]
If the extensions are accessible, then the query returns ["123"]
, as shown in the example.
Use the Hive Extensions in HiveServer2
HIVE_AUX_JARS_PATH
and to the configuration property hive.aux.jars.path
.$OXH_HOME/hive/lib/woodstox-core-asl-*.jar
$OXH_HOME/hive/lib/apache-xmlbeans.jar
$OXH_HOME/hive/lib/oxh-hive.jar
$OXH_HOME/hive/lib/oxh-mapreduce.jar
$OXH_HOME/hive/lib/stax2-api-*.jar
$OXH_HOME/hive/lib/xqjapi.jar
$OXH_HOME/hive/lib/oxquery.jar
See Also:
-
Permanently Declaring the Hive Functions.
You can use
init.sql
to temporarily declare the functions, as shown in previous examples. You can also permanently declare them.
About the Hive Functions
The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:
-
xml_query
: Returns the result of a query as an array ofSTRING
values. -
xml_query_as_primitive
: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it. -
xml_exists
: Tests if the result of a query is empty -
xml_table
: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.
Permanently Declaring the Hive Functions
In the examples in the previous section, $OXH_HOME/hive/init.sql
is used to temporarily declare the XML extensions for Hive functions. However, as an alternative, you can permanently declare the functions so that init.sql
is not needed. Use the following commands to permanently declare the functions.
CREATE FUNCTION xml_query AS 'oracle.hadoop.xquery.hive.OXMLQueryUDF' ;
CREATE FUNCTION xml_query_as_bigint AS 'oracle.hadoop.xquery.hive.OXMLQueryBigintUDF' ;
CREATE FUNCTION xml_query_as_int AS 'oracle.hadoop.xquery.hive.OXMLQueryIntUDF' ;
CREATE FUNCTION xml_query_as_smallint AS 'oracle.hadoop.xquery.hive.OXMLQuerySmallintUDF' ;
CREATE FUNCTION xml_query_as_tinyint AS 'oracle.hadoop.xquery.hive.OXMLQueryTinyintUDF' ;
CREATE FUNCTION xml_query_as_float AS 'oracle.hadoop.xquery.hive.OXMLQueryFloatUDF' ;
CREATE FUNCTION xml_query_as_double AS 'oracle.hadoop.xquery.hive.OXMLQueryDoubleUDF' ';
CREATE FUNCTION xml_query_as_boolean AS 'oracle.hadoop.xquery.hive.OXMLQueryBooleanUDF' ;
CREATE FUNCTION xml_query_as_string AS 'oracle.hadoop.xquery.hive.OXMLQueryStringUDF' ;
CREATE FUNCTION xml_exists AS 'oracle.hadoop.xquery.hive.OXMLExists' ;
CREATE FUNCTION xml_table AS 'oracle.hadoop.xquery.hive.OXMLTableUDTF' ;
Create XML Tables
This section describes how you can use the Hive CREATE TABLE
statement to create tables over large XML documents.
Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.
To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.
Hive CREATE TABLE Syntax for XML Tables
The following is the basic syntax of the Hive CREATE TABLE
statement for creating a Hive table over XML files:
CREATE TABLE table_name (columns) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES(configuration)
Parameters
Parameter | Description |
---|---|
columns |
All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions." |
configuration |
Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas. |
Note:
Inserting data into XML tables is not supported.
CREATE TABLE Configuration Properties
Use these configuration properties in the configuration
parameter of the CREATE TABLE
command.
- oxh-default-namespace
-
Sets the default namespace for expressions in the table definition and for XML parsing. The value is a URI.
This example defines the default namespace:
"oxh-default-namespace" = "http://example.com/foo"
- oxh-charset
-
Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.
All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.
This example defines the character set:
"oxh-charset" = "ISO-8859-1"
- oxh-column.name
-
Specifies how an element selected by the
oxh-elements
property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is
NULL
. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.Any column of the table that does not have a corresponding
oxh-column
property behaves as if the following property is specified:"oxh-column.name" = "(./name | ./@name)[1]"
Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."
- oxh-elements
-
Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.
This example maps each element named
foo
in the XML to a single row in the Hive table:"oxh-elements" = "foo"
The next example maps each element named either
foo
orbar
in the XML to a row in the Hive table:"oxh-elements" = "foo, bar"
- oxh-entity.name
-
Defines a set of entity reference definitions.
In the following example, entity references in the XML are expanded from
&foo;
to "foo value" and from&bar;
to "bar value"."oxh-entity.foo" = "foo value" "oxh-entity.bar" = "bar value"
- oxh-namespace.prefix
-
Defines a namespace binding.
This example binds the prefix
myns
to the namespacehttp://example.org
:"oxh-namespace.myns" = "http://example.org"
You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The
oxh-element
andoxh-column
property values can also reference them.In the following example, only
foo
elements in thehttp://example.org
namespace are mapped to table rows:"oxh-namespace.myns" = "http://example.org", "oxh-elements" = "myns:foo", "oxh-column.bar" = "./myns:bar"
CREATE TABLE Examples
This section includes the following examples:
Syntax Example
This example shows how to map XML elements to column names.
Example 7-1 Basic Column Mappings
In the following table definition, the oxh-elements
property specifies that each element named foo
in the XML is mapped to a single row in the table. The oxh-column
properties specify that a Hive table column named BAR
gets the value of the child element named bar
converted to STRING
, and the column named ZIP
gets the value of the child element named zip
converted to INT
.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "./bar", "oxh-column.zip" = "./zip" )
Example 7-2 Conditional Column Mappings
In this modified definition of the ZIP
column, the column receives a value of -1 if the foo
element does not have a child zip
element, or if the zip
element contains a nonnumeric value:
"oxh-column.zip" = " if (./zip castable as xs:int) then xs:int(./zip) else -1 "
Example 7-3 Default Column Mappings
The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR
and ZIP
columns.
Table Definition 1
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "(./bar | ./@bar)[1]", "oxh-column.zip" = "(./zip | ./@zip)[1]" )
Table Definition 2
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo" )
Simple Examples
These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment
element in the document (comments.xml
) has one or more like
elements that indicate that the user liked the comment.
<comments> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments>
In the CREATE TABLE
examples, the comments.xml
input file is in the current working directory of the local file system.
Example 7-4 Creating a Table
The following Hive CREATE TABLE
command creates a table named COMMENTS
with a row for each comment containing the user names, text, and number of likes:
hive> CREATE TABLE comments (usr STRING, content STRING, likeCt INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likeCt" = "fn:count(./like)" );
The Hive LOAD DATA
command loads comments.xml
into the COMMENTS
table.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments; ]
The following query shows the content of the COMMENTS
table.
hive> SELECT usr, content, likeCt FROM comments;
.
.
.
john It is raining :( 0
kelly I won the lottery! 2
mike Happy New Year! 1
Example 7-5 Querying an XML Column
This CREATE TABLE
command is like the previous example, except that the like
elements are produced as XML in a STRING
column.
hive> CREATE TABLE comments2 (usr STRING, content STRING, likes STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
The following query shows the content of the COMMENTS2
table.
hive> SELECT usr, content, likes FROM comments2;
.
.
.
john It is raining :( <likes/>
kelly I won the lottery! <likes><like user="john"/><like user="mike"/></likes>
mike Happy New Year! <likes><like user="laura"/></likes>
The next query extracts the user names from the like
elements:
hive> SELECT usr, t.user FROM comments2 LATERAL VIEW > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user; . . . kelly john kelly mike mike laura
Example 7-6 Generating XML in a Single String Column
This command creates a table named COMMENTS3
with a row for each comment, and produces the XML in a single STRING
column.
hive> CREATE TABLE comments3 (xml STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.xml" = "fn:serialize(.)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;
The following query shows the contents of the XML column:
hive> SELECT xml FROM comments3;
.
.
.
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
The next query extracts the IDs and converts them to integers:
hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
.
.
.
12345
56789
54321
OpenStreetMap Examples
These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node
, way
, and relation
elements.
In these examples, the OpenStreetMap XML files are stored in the /user/name/osm
HDFS directory.
See Also:
-
To download OpenStreetMap data, go to
-
For information about the OpenStreetMap XML format, go to
Example 7-7 Creating a Table Over OpenStreetMap XML
This example creates a table over OpenStreetMap XML with one row for each node
element as follows:
-
The
id
,lat
,lon
, anduser
attributes of thenode
element are mapped to table columns. -
The year is extracted from the
timestamp
attribute and mapped to theYEAR
column. If a node does not have atimestamp
attribute, then-1
is used for the year. -
If the
node
element has any childtag
elements, then they are stored as an XML string in theTAGS
column. Ifnode
has no childtag
elements, then column value isNULL
.
hive> CREATE EXTERNAL TABLE nodes ( id BIGINT, latitude DOUBLE, longitude DOUBLE, year SMALLINT, tags STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node", "oxh-column.id" = "./@id", "oxh-column.latitude" = "./@lat", "oxh-column.longitude" = "./@lon", "oxh-column.year" = " if (fn:exists(./@timestamp)) then fn:year-from-dateTime(xs:dateTime(./@timestamp)) else -1 ", "oxh-column.tags" = " if (fn:exists(./tag)) then fn:serialize(<tags>{./tag}</tags>) else () " );
The following query returns the number of nodes per year:
hive> SELECT year, count(*) FROM nodes GROUP BY year;
This query returns the total number of tags across nodes:
hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
Example 7-8
In OpenStreetMap XML, the node
, way
, and relation
elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node
, way
, and relation
element.
hive> CREATE EXTERNAL TABLE osm ( id BIGINT, uid BIGINT, type STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node, way, relation", "oxh-column.id" = "./@id", "oxh-column.uid" = "./@uid", "oxh-column.type" = "./name()" );
The following query returns the number of node
, way
, and relation
elements. The TYPE
column is set to the name of the selected element, which is either node
, way
, or relation
.
hive> SELECT type, count(*) FROM osm GROUP BY type;
This query returns the number of distinct user IDs:
hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;
See Also:
For a description of the OpenStreetMap elements and attributes, go to
Oracle XML Functions for Hive Reference
This section describes the Oracle XML Extensions for Hive. It describes the following commands and functions:
Data Type Conversions
This table shows the conversions that occur automatically between Hive primitives and XML schema types.
Table 7-1 Data Type Equivalents
Hive | XML schema |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hive Access to External Files
The Hive functions have access to the following external file resources:
-
XML schemas
-
XML documents
-
XQuery library modules
You can address these files by their URI from either HTTP (by using the http://...
syntax) or the local file system (by using the file://...
syntax). In this example, relative file locations are resolved against the local working directory of the task, so that URIs such as bar.xsd can be used to access files that were added to the distributed cache:
xml_query(" import schema namespace tns='http://example.org' at 'bar.xsd'; validate { ... } ", . . .
To access a local file, first add it to the Hadoop distributed cache using the Hive ADD FILE
command. For example:
ADD FILE /local/mydir/thisfile.xsd;
Otherwise, you must ensure that the file is available on all nodes of the cluster, such as by mounting the same network drive or simply copying the file to every node. The default base URI is set to the local working directory.
See Also:
-
For examples of accessing the distributed cache, see Example 7-15 for
xml_query
, Example 7-22 forxml_query_as_
primitive
, and Example 7-31 forxml_table
. -
For information about the default base URI, see XQuery 3.1: An XML Query Language at
Online Documentation of Functions
You can get online Help for the Hive extension functions by using this command:
DESCRIBE FUNCTION [EXTENDED] function_name;
This example provides a brief description of the xml_query
function:
hive> describe function xml_query;
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array
The EXTENDED
option provides a detailed description and examples:
hive> describe function extended xml_query;
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array
Evaluates an XQuery expression with the specified bindings. The query argument must be a STRING and the bindings argument must be a STRING or a STRUCT. If the bindings argument is a STRING, it is parsed as XML and bound to the initial context item of the query. For example:
> SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
["hello", "world"]
.
.
.
xml_exists
Tests if the result of a query is empty.
Signature
xml_exists( STRING query, { STRING | STRUCT } bindings ) as BOOLEAN
Description
- query
-
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (
http://...
). Use the XQueryfn:doc
function for XML documents, and thefn:unparsed-text
andfn:parsed-text-lines
functions to access plain text files.If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
- bindings
-
The input that the query processes. The value can be an XML
STRING
or aSTRUCT
of variable values:-
STRING
: The string is bound to the initial context item of the query as XML. -
STRUCT
: ASTRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be typeSTRING
, and the value fields can be any supported primitive. See "Data Type Conversions."
-
Return Value
true
if the result of the query is not empty; false
if the result is empty or the query raises a dynamic error
Notes
The first dynamic error raised by a query is logged, but subsequent errors are suppressed.
Examples
Example 7-9 STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
Hive> SELECT xml_exists("x/y", "<x><y>123</y></x>") FROM src LIMIT 1;
.
.
.
true
Example 7-10 STRUCT Binding
This example defines two query variables, $data
and $value
:
Hive> SELECT xml_exists( "parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'/><y id='2'/></x>", "value", 2 ) ) FROM src LIMIT 1; . . . true
Example 7-11 Error Logging
In this example, an error is written to the log, because the input XML is invalid:
hive> SELECT xml_exists("x/y", "<x><y>123</invalid></x>") FROM src LIMIT 1;
.
.
.
false
xml_query
Returns the result of a query as an array of STRING
values.
Signature
xml_query( STRING query, { STRING | STRUCT } bindings ) as ARRAY<STRING>
Description
- query
-
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (
http://...
). Use the XQueryfn:doc
function for XML documents, and thefn:unparsed-text
andfn:parsed-text-lines
functions to access plain text files. See Example 7-15.If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
- bindings
-
The input that the query processes. The value can be an XML
STRING
or aSTRUCT
of variable values:-
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-12. -
STRUCT
: ASTRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be typeSTRING
, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-13.
-
Return Value
A Hive array of STRING
values, which are the result of the query converted to a sequence of atomic values. If the result of the query is empty, then the return value is an empty array.
Examples
Example 7-12 Using a STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
hive> SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1; . . . ["hello","world"]
Example 7-13 Using a STRUCT Binding
In this example, the second argument is a STRUCT
that defines two query variables, $data
and $value
. The values of the variables in the STRUCT
are converted to XML schema types as described in "Data Type Conversions."
hive> SELECT xml_query( "fn:parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>", "value", 1 ) ) FROM src LIMIT 1; . . . ["hello"]
Example 7-14 Obtaining Serialized XML
This example uses the fn:serialize
function to return serialized XML:
hive> SELECT xml_query( "for $y in x/y return fn:serialize($y) ", "<x><y>hello</y><z/><y>world</y></x>" ) FROM src LIMIT 1; . . . ["<y>hello</y>","<y>world</y>"]
Example 7-15 Accessing the Hadoop Distributed Cache
This example adds a file named test.xml to the distributed cache, and then queries it using the fn:doc
function. The file contains this value:
<x><y>hello</y><z/><y>world</y></x>
hive> ADD FILE test.xml; Added resource: test.xml hive> SELECT xml_query("fn:doc('test.xml')/x/y", NULL) FROM src LIMIT 1; . . . ["hello","world"]
Example 7-16 Results of a Failed Query
The next example returns an empty array because the input XML is invalid. The XML parsing error will be written to the log:
hive> SELECT xml_query("x/y", "<x><y>hello</y></invalid") FROM src LIMIT 1;
.
.
.
[]
xml_query_as_primitive
Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it:
-
xml_query_as_string
-
xml_query_as_boolean
-
xml_query_as_tinyint
-
xml_query_as_smallint
-
xml_query_as_int
-
xml_query_as_bigint
-
xml_query_as_double
-
xml_query_as_float
Signature
xml_query_as_primitive ( STRING query, {STRUCT | STRING} bindings, } as primitive
Description
- query
-
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (
http://...
). Use the XQueryfn:doc
function for XML documents, and thefn:unparsed-text
andfn:parsed-text-lines
functions to access plain text files. See Example 7-15.If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
- bindings
-
The input that the query processes. The value can be an XML
STRING
or aSTRUCT
of variable values:-
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-17. -
STRUCT
: ASTRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be typeSTRING
, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-18.The first item in the result of the query is cast to the XML schema type that maps to the primitive type of the function. If the query returns multiple items, then all but the first are ignored.
-
Return Value
A Hive primitive value, which is the first item returned by the query, converted to an atomic value. If the result of the query is empty, then the return value is NULL
.
Examples
Example 7-17 Using a STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y></x>") FROM src LIMIT 1;
.
.
.
"hello"
The following are string binding examples that use other primitive functions:
hive> SELECT xml_query_as_int("x/y", "<x><y>123</y></x>") FROM src LIMIT 1; . . . 123 hive> SELECT xml_query_as_double("x/y", "<x><y>12.3</y></x>") FROM src LIMIT 1; . . . 12.3 hive> SELECT xml_query_as_boolean("x/y", "<x><y>true</y></x>") FROM src LIMIT 1; . . . true
Example 7-18 Using a STRUCT Binding
In this example, the second argument is a STRUCT
that defines two query variables, $data
and $value
. The values of the variables in the STRUCT
are converted to XML schema types as described in "Data Type Conversions."
hive> SELECT xml_query_as_string( "fn:parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>", "value", 2 ) ) FROM src LIMIT 1; . . . world
Example 7-19 Returning Multiple Query Results
This example returns only the first item (hello) from the query. The second item (world) is discarded.
hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
.
.
.
hello
Example 7-20 Returning Empty Query Results
This example returns NULL
because the result of the query is empty:
hive> SELECT xml_query_as_string("x/foo", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
.
.
.
NULL
Example 7-21 Obtaining Serialized XML
These examples use the fn:serialize
function to return complex XML elements as a STRING
value:
hive> SELECT xml_query_as_string("fn:serialize(x/y[1])", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1; . . . "<y>hello</y>" hive> SELECT xml_query_as_string( "fn:serialize(<html><head><title>{$desc}</title></head><body>Name: {$name}</body></html>)", struct( "desc", "Employee Details", "name", "John Doe" ) ) FROM src LIMIT 1; ... <html><head><title>Employee Details</title></head><body>Name: John Doe</body></html>
Example 7-22 Accessing the Hadoop Distributed Cache
This example adds a file named test.xml
to the distributed cache, and then queries it using the fn:doc
function. The file contains this value:
<x><y>hello</y><z/><y>world</y></x>
Hive> ADD FILE test.xml; Added resource: test.xml Hive> SELECT xml_query_as_string("fn:doc('test.xml')/x/y[1]", NULL) FROM src LIMIT 1; . . . hello
Example 7-23 Results of a Failed Query
This example returns NULL
because </invalid
is missing an angle bracket. An XML parsing error is written to the log:
Hive> SELECT xml_query_as_string("x/y", "<x><y>hello</invalid") FROM src LIMIT 1;
.
.
.
NULL
This example returns NULL
because foo
cannot be cast as xs:float
. A cast error is written to the log:
Hive> SELECT xml_query_as_float("x/y", "<x><y>foo</y></x>") FROM src LIMIT 1;
.
.
.
NULL
xml_table
A user-defined table-generating function (UDTF) that maps an XML value to zero or more table rows. This function enables nested repeating elements in XML to be mapped to Hive table rows.
Signature
xml_table( STRUCT? namespaces, STRING query, {STRUCT | STRING} bindings, STRUCT? columns )
Description
- namespaces
-
Identifies the namespaces that the query and column expressions can use. Optional.
The value is a
STRUCT
with an even number ofSTRING
fields. Each pair of fields defines a namespace binding (prefix, URI) that can be used by the query or the column expressions. See Example 7-26. - query
-
An XQuery or XPath expression that generates a table row for each returned value. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
If a dynamic error occurs during query processing, then the function does not raise an error, but logs it the first time. Subsequent dynamic errors are not logged.
- bindings
-
The input that the query processes. The value can be an XML
STRING
or aSTRUCT
of variable values:-
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-24. -
STRUCT
: ASTRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be typeSTRING
, and the value fields can be any supported primitive. See "Data Type Conversions."
-
- columns
-
The XQuery or XPath expressions that define the columns of the generated rows. Optional.
The value is a
STRUCT
that contains the additional XQuery expressions. The XQuery expressions must be constantSTRING
values, because they are only read the first time the function is evaluated. For each column expression in theSTRUCT
, there is one column in the table.For each item returned by the query, the column expressions are evaluated with the current item as the initial context item of the expression. The results of the column expressions are converted to
STRING
values and become the values of the row.If the result of a column expression is empty or if a dynamic error occurs while evaluating the column expression, then the corresponding column value is
NULL
. If a column expression returns more than one item, then all but the first are ignored.Omitting the columns argument is the same as specifying
'struct(".")'
. See Example 7-25.
Return Value
One table row for each item returned by the query argument.
Notes
The XML table adapter enables Hive tables to be created over large XML files in HDFS. See "Hive CREATE TABLE Syntax for XML Tables".
Examples
Note:
You could use the xml_query_as_string
function to achieve the same result in this example. However, xml_table
is more efficient, because a single function call sets all three column values and parses the input XML only once for each row. The xml_query_as_string
function requires a separate function call for each of the three columns and reparses the same input XML value each time.
Example 7-24 Using a STRING Binding
The query "x/y" returns two <y>
elements, therefore two table rows are generated. Because there are two column expressions ("./z", "./w"), each row has two columns.
hive> SELECT xml_table( "x/y", "<x> <y> <z>a</z> <w>b</w> </y> <y> <z>c</z> </y> </x> ", struct("./z", "./w") ) AS (z, w) FROM src; . . . a b c NULL
Example 7-25 Using the Columns Argument
The following two queries are equivalent. The first query explicitly specifies the value of the columns argument:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>", struct(".") ) AS (y) FROM src; . . . hello world
The second query omits the columns argument, which defaults to struct(".")
:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>" ) AS (y) FROM src; . . . hello world
Example 7-26 Using the Namespaces Argument
This example specifies the optional namespaces argument, which identifies an ns
prefix and a URI of http://example.org
.
hive> SELECT xml_table( struct("ns", "http://example.org"), "ns:x/ns:y", "<x xmlns='http://example.org'><y><z/></y><y><z/><z/></y></x>", struct("count(./ns:z)") ) AS (y) FROM src; . . . 1 2
Example 7-27 Querying a Hive Table of XML Documents
This example queries a table named COMMENTS3
, which has a single column named XML_STR
of type STRING
. It contains these three rows:
hive> SELECT xml_str FROM comments3;
<comment id="12345" user="john" text="It is raining:("/>
<comment id="56789" user="kelly" text="I won the lottery!"><like user="john"/><like user="mike"/></comment>
<comment id="54321" user="mike" text="Happy New Year!"><like user="laura"/></comment>
The following query shows how to extract the user, text, and number of likes from the COMMENTS3
table.
hive> SELECT t.id, t.usr, t.likes FROM comments3 LATERAL VIEW xml_table( "comment", comments.xml_str, struct("./@id", "./@user", "fn:count(./like)") ) t AS id, usr, likes; 12345 john 0 56789 kelly 2 54321 mike 1
Example 7-28 Mapping Nested XML Elements to Table Rows
This example shows how to use xml_table
to flatten nested, repeating XML elements into table rows. See the previous example for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 56789 kelly john 56789 kelly mike 54321 mike laura
Example 7-29 Mapping Optional Nested XML Elements to Table Rows
This example is a slight modification of the previous example that produces a row even when a comment has no likes. See Example 7-27 for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like allowing empty in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 12345 john 56789 kelly john 56789 kelly mike 54321 mike laura
Example 7-30 Creating a New View
You can create views and new tables using xml_table
, the same as any table-generating function. This example creates a new view named COMMENTS_LIKES
from the COMMENTS
table:
hive> CREATE VIEW comments_likes AS SELECT xml_table( "comment", comments.xml_str, struct("./@id", "count(./like)") ) AS (id, likeCt) FROM comments;
This example queries the new view:
> SELECT * FROM comments_likes WHERE CAST(likeCt AS INT) != 0; 56789 2 54321 1
Example 7-31 Accessing the Hadoop Distributed Cache
You can access XML documents and text files added to the distributed cache by using the fn:doc
and fn:unparsed-text
functions.
This example queries a file named test.xml that contains this string:
<x><y>hello</y><z/><y>world</y></x>
hive> ADD FILE test.xml; Added resource: test.xml hive> SELECT xml_table("fn:doc('test.xml')/x/y", NULL) AS y FROM src; . . . hello world