This chapter explains how to use the XML extensions for Apache Hive provided with Oracle XQuery for Hadoop. The chapter contains the following sections:
The XML Extensions for Hive provide XML processing support that enables you to do the following:
Query large XML files in HDFS as Hive tables
Query XML strings in Hive tables
Query XML file resources in the Hadoop distributed cache
Efficiently extract atomic values from XML without using expensive DOM parsing
Retrieve, generate, and transform complex XML elements
Generate multiple table rows from a single XML value
Manage missing and dirty data in XML
The XML extensions also support these W3C modern standards:
XQuery 3.1
XQuery Update Facility 1.0 (transform expressions)
XPath 3.1
XML Schema 1.0
XML Namespaces
The XML extensions have two components:
XML InputFormat and SerDe for creating XML tables
See "Creating XML Tables."
XML function library
See "About the Hive Functions."
To enable the Oracle XQuery for Hadoop extensions, use the --auxpath
and -i
arguments when starting Hive:
$ hive --auxpath \ $OXH_HOME/hive/lib/oxh-hive.jar,\ $OXH_HOME/hive/lib/oxh-mapreduce.jar,\ $OXH_HOME/hive/lib/oxquery.jar,\ $OXH_HOME/hive/lib/xqjapi.jar,\ $OXH_HOME/hive/lib/apache-xmlbeans.jar,\ $OXH_HOME/hive/lib/woodstox-core-asl-4.2.0.jar,\ $OXH_HOME/hive/lib/stax2-api-3.1.1.jar \ -i $OXH_HOME/hive/init.sql
Note:
On the Oracle BigDataLite VM, HIVE_AUX_JARS_PATH
contains the Hive extensions by default and hence specifying --auxpath
is unnecessary.
The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC
, loads one row into it, and calls the xml_query
function.
To verify that the extensions are accessible:
Log in to a server in the Hadoop cluster where you plan to work.
Create a text file named src.txt
that contains one line:
$ echo "XXX" > src.txt
Start the Hive command-line interface (CLI):
$ hive --auxpath \ $OXH_HOME/hive/lib/oxh-hive.jar,\ $OXH_HOME/hive/lib/oxh-mapreduce.jar,\ $OXH_HOME/hive/lib/oxquery.jar,\ $OXH_HOME/hive/lib/xqjapi.jar,\ $OXH_HOME/hive/lib/apache-xmlbeans.jar,\ $OXH_HOME/hive/lib/woodstox-core-asl-4.2.0.jar,\ $OXH_HOME/hive/lib/stax2-api-3.1.1.jar \ -i $OXH_HOME/hive/init.sql
The init.sql
file contains the CREATE TEMPORARY FUNCTION
statements that declare the XML functions.
Create a simple table:
hive> CREATE TABLE src(dummy STRING);
The SRC
table is needed only to fulfill a SELECT
syntax requirement. It is like the DUAL
table in Oracle Database, which is referenced in SELECT
statements to test SQL functions.
Load data from src.txt
into the table:
hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
Query the table using Hive SELECT
statements:
hive> SELECT * FROM src; OK xxx
Call an Oracle XQuery for Hadoop function for Hive. This example calls the xml_query
function to parse an XML string:
hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
.
.
.
["123"]
If the extensions are accessible, then the query returns ["123"]
, as shown in the example.
The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:
xml_query
: Returns the result of a query as an array of STRING
values.
xml_query_as_primitive
: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it.
xml_exists
: Tests if the result of a query is empty
xml_table
: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.
This section describes how you can use the Hive CREATE TABLE
statement to create tables over large XML documents.
Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.
To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.
The following is the basic syntax of the Hive CREATE TABLE
statement for creating a Hive table over XML files:
CREATE TABLE table_name (columns) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES(configuration)
Parameters
Parameter | Description |
---|---|
columns |
All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions." |
configuration |
Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas. |
Note:
Inserting data into XML tables is not supported.
Use these configuration properties in the configuration
parameter of the CREATE TABLE
command.
Sets the default namespace for expressions in the table definition and for XML parsing. The value is a URI.
This example defines the default namespace:
"oxh-default-namespace" = "http://example.com/foo"
Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.
All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.
This example defines the character set:
"oxh-charset" = "ISO-8859-1"
Specifies how an element selected by the oxh-elements
property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.
Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL
. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.
Any column of the table that does not have a corresponding oxh-column
property behaves as if the following property is specified:
"oxh-column.name" = "(./name | ./@name)[1]"
Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."
Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.
This example maps each element named foo
in the XML to a single row in the Hive table:
"oxh-elements" = "foo"
The next example maps each element named either foo
or bar
in the XML to a row in the Hive table:
"oxh-elements" = "foo, bar"
Defines a set of entity reference definitions.
In the following example, entity references in the XML are expanded from &foo;
to "foo value" and from &bar;
to "bar value".
"oxh-entity.foo" = "foo value" "oxh-entity.bar" = "bar value"
Defines a namespace binding.
This example binds the prefix myns
to the namespace http://example.org
:
"oxh-namespace.myns" = "http://example.org"
You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element
and oxh-column
property values can also reference them.
In the following example, only foo
elements in the http://example.org
namespace are mapped to table rows:
"oxh-namespace.myns" = "http://example.org", "oxh-elements" = "myns:foo", "oxh-column.bar" = "./myns:bar"
This section includes the following examples:
This example shows how to map XML elements to column names.
Example 7-1 Basic Column Mappings
In the following table definition, the oxh-elements
property specifies that each element named foo
in the XML is mapped to a single row in the table. The oxh-column
properties specify that a Hive table column named BAR
gets the value of the child element named bar
converted to STRING
, and the column named ZIP
gets the value of the child element named zip
converted to INT
.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "./bar", "oxh-column.zip" = "./zip" )
Example 7-2 Conditional Column Mappings
In this modified definition of the ZIP
column, the column receives a value of -1 if the foo
element does not have a child zip
element, or if the zip
element contains a nonnumeric value:
"oxh-column.zip" = " if (./zip castable as xs:int) then xs:int(./zip) else -1 "
Example 7-3 Default Column Mappings
The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR
and ZIP
columns.
Table Definition 1
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "(./bar | ./@bar)[1]", "oxh-column.zip" = "(./zip | ./@zip)[1]" )
Table Definition 2
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo" )
These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment
element in the document (comments.xml
) has one or more like
elements that indicate that the user liked the comment.
<comments> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments>
In the CREATE TABLE
examples, the comments.xml
input file is in the current working directory of the local file system.
Example 7-4 Creating a Table
The following Hive CREATE TABLE
command creates a table named COMMENTS
with a row for each comment containing the user names, text, and number of likes:
hive> CREATE TABLE comments (usr STRING, content STRING, likeCt INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likeCt" = "fn:count(./like)" );
The Hive LOAD DATA
command loads comments.xml
into the COMMENTS
table.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments; ]
The following query shows the content of the COMMENTS
table.
hive> SELECT usr, content, likeCt FROM comments;
.
.
.
john It is raining :( 0
kelly I won the lottery! 2
mike Happy New Year! 1
Example 7-5 Querying an XML Column
This CREATE TABLE
command is like the previous example, except that the like
elements are produced as XML in a STRING
column.
hive> CREATE TABLE comments2 (usr STRING, content STRING, likes STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
The following query shows the content of the COMMENTS2
table.
hive> SELECT usr, content, likes FROM comments2;
.
.
.
john It is raining :( <likes/>
kelly I won the lottery! <likes><like user="john"/><like user="mike"/></likes>
mike Happy New Year! <likes><like user="laura"/></likes>
The next query extracts the user names from the like
elements:
hive> SELECT usr, t.user FROM comments2 LATERAL VIEW > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user; . . . kelly john kelly mike mike laura
Example 7-6 Generating XML in a Single String Column
This command creates a table named COMMENTS3
with a row for each comment, and produces the XML in a single STRING
column.
hive> CREATE TABLE comments3 (xml STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.xml" = "fn:serialize(.)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;
The following query shows the contents of the XML column:
hive> SELECT xml FROM comments3;
.
.
.
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
The next query extracts the IDs and converts them to integers:
hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
.
.
.
12345
56789
54321
These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node
, way
, and relation
elements.
In these examples, the OpenStreetMap XML files are stored in the /user/name/osm
HDFS directory.
See Also:
To download OpenStreetMap data, go to
For information about the OpenStreetMap XML format, go to
Example 7-7 Creating a Table Over OpenStreetMap XML
This example creates a table over OpenStreetMap XML with one row for each node
element as follows:
The id
, lat
, lon
, and user
attributes of the node
element are mapped to table columns.
The year is extracted from the timestamp
attribute and mapped to the YEAR
column. If a node does not have a timestamp
attribute, then -1
is used for the year.
If the node
element has any child tag
elements, then they are stored as an XML string in the TAGS
column. If node
has no child tag
elements, then column value is NULL
.
hive> CREATE EXTERNAL TABLE nodes ( id BIGINT, latitude DOUBLE, longitude DOUBLE, year SMALLINT, tags STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node", "oxh-column.id" = "./@id", "oxh-column.latitude" = "./@lat", "oxh-column.longitude" = "./@lon", "oxh-column.year" = " if (fn:exists(./@timestamp)) then fn:year-from-dateTime(xs:dateTime(./@timestamp)) else -1 ", "oxh-column.tags" = " if (fn:exists(./tag)) then fn:serialize(<tags>{./tag}</tags>) else () " );
The following query returns the number of nodes per year:
hive> SELECT year, count(*) FROM nodes GROUP BY year;
This query returns the total number of tags across nodes:
hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
Example 7-8
In OpenStreetMap XML, the node
, way
, and relation
elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node
, way
, and relation
element.
hive> CREATE EXTERNAL TABLE osm ( id BIGINT, uid BIGINT, type STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node, way, relation", "oxh-column.id" = "./@id", "oxh-column.uid" = "./@uid", "oxh-column.type" = "./name()" );
The following query returns the number of node
, way
, and relation
elements. The TYPE
column is set to the name of the selected element, which is either node
, way
, or relation
.
hive> SELECT type, count(*) FROM osm GROUP BY type;
This query returns the number of distinct user IDs:
hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;
See Also:
For a description of the OpenStreetMap elements and attributes, go to
This section describes the Oracle XML Extensions for Hive. It describes the following commands and functions:
This table shows the conversions that occur automatically between Hive primitives and XML schema types.
Table 7-1 Data Type Equivalents
Hive | XML schema |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Hive functions have access to the following external file resources:
XML schemas
XML documents
XQuery library modules
You can address these files by their URI from either HTTP (by using the http://...
syntax) or the local file system (by using the file://...
syntax). In this example, relative file locations are resolved against the local working directory of the task, so that URIs such as bar.xsd can be used to access files that were added to the distributed cache:
xml_query(" import schema namespace tns='http://example.org' at 'bar.xsd'; validate { ... } ", . . .
To access a local file, first add it to the Hadoop distributed cache using the Hive ADD FILE
command. For example:
ADD FILE /local/mydir/thisfile.xsd;
Otherwise, you must ensure that the file is available on all nodes of the cluster, such as by mounting the same network drive or simply copying the file to every node. The default base URI is set to the local working directory.
See Also:
For examples of accessing the distributed cache, see Example 7-15 for xml_query
, Example 7-22 for xml_query_as_
primitive
, and Example 7-31 for xml_table
.
For information about the default base URI, see XQuery 3.1: An XML Query Language at
You can get online Help for the Hive extension functions by using this command:
DESCRIBE FUNCTION [EXTENDED] function_name;
This example provides a brief description of the xml_query
function:
hive> describe function xml_query;
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array
The EXTENDED
option provides a detailed description and examples:
hive> describe function extended xml_query;
OK
xml_query(query, bindings) - Returns the result of the query as a STRING array
Evaluates an XQuery expression with the specified bindings. The query argument must be a STRING and the bindings argument must be a STRING or a STRUCT. If the bindings argument is a STRING, it is parsed as XML and bound to the initial context item of the query. For example:
> SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
["hello", "world"]
.
.
.
Tests if the result of a query is empty.
Signature
xml_exists( STRING query, { STRING | STRUCT } bindings ) as BOOLEAN
Description
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (http://...
). Use the XQuery fn:doc
function for XML documents, and the fn:unparsed-text
and fn:parsed-text-lines
functions to access plain text files.
If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
The input that the query processes. The value can be an XML STRING
or a STRUCT
of variable values:
STRING
: The string is bound to the initial context item of the query as XML.
STRUCT
: A STRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING
, and the value fields can be any supported primitive. See "Data Type Conversions."
Return Value
true
if the result of the query is not empty; false
if the result is empty or the query raises a dynamic error
Notes
The first dynamic error raised by a query is logged, but subsequent errors are suppressed.
Examples
Example 7-9 STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
Hive> SELECT xml_exists("x/y", "<x><y>123</y></x>") FROM src LIMIT 1;
.
.
.
true
Example 7-10 STRUCT Binding
This example defines two query variables, $data
and $value
:
Hive> SELECT xml_exists( "parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'/><y id='2'/></x>", "value", 2 ) ) FROM src LIMIT 1; . . . true
Example 7-11 Error Logging
In this example, an error is written to the log, because the input XML is invalid:
hive> SELECT xml_exists("x/y", "<x><y>123</invalid></x>") FROM src LIMIT 1;
.
.
.
false
Returns the result of a query as an array of STRING
values.
Signature
xml_query( STRING query, { STRING | STRUCT } bindings ) as ARRAY<STRING>
Description
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (http://...
). Use the XQuery fn:doc
function for XML documents, and the fn:unparsed-text
and fn:parsed-text-lines
functions to access plain text files. See Example 7-15.
If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
The input that the query processes. The value can be an XML STRING
or a STRUCT
of variable values:
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-12.
STRUCT
: A STRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING
, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-13.
Return Value
A Hive array of STRING
values, which are the result of the query converted to a sequence of atomic values. If the result of the query is empty, then the return value is an empty array.
Examples
Example 7-12 Using a STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
hive> SELECT xml_query("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1; . . . ["hello","world"]
Example 7-13 Using a STRUCT Binding
In this example, the second argument is a STRUCT
that defines two query variables, $data
and $value
. The values of the variables in the STRUCT
are converted to XML schema types as described in "Data Type Conversions."
hive> SELECT xml_query( "fn:parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>", "value", 1 ) ) FROM src LIMIT 1; . . . ["hello"]
Example 7-14 Obtaining Serialized XML
This example uses the fn:serialize
function to return serialized XML:
hive> SELECT xml_query( "for $y in x/y return fn:serialize($y) ", "<x><y>hello</y><z/><y>world</y></x>" ) FROM src LIMIT 1; . . . ["<y>hello</y>","<y>world</y>"]
Example 7-15 Accessing the Hadoop Distributed Cache
This example adds a file named test.xml to the distributed cache, and then queries it using the fn:doc
function. The file contains this value:
<x><y>hello</y><z/><y>world</y></x>
hive> ADD FILE test.xml; Added resource: test.xml hive> SELECT xml_query("fn:doc('test.xml')/x/y", NULL) FROM src LIMIT 1; . . . ["hello","world"]
Example 7-16 Results of a Failed Query
The next example returns an empty array because the input XML is invalid. The XML parsing error will be written to the log:
hive> SELECT xml_query("x/y", "<x><y>hello</y></invalid") FROM src LIMIT 1;
.
.
.
[]
Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it:
xml_query_as_string
xml_query_as_boolean
xml_query_as_tinyint
xml_query_as_smallint
xml_query_as_int
xml_query_as_bigint
xml_query_as_double
xml_query_as_float
Signature
xml_query_as_primitive ( STRING query, {STRUCT | STRING} bindings, } as primitive
Description
An XQuery or XPath expression. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
You can access files that are stored in the Hadoop distributed cache and HTTP resources (http://...
). Use the XQuery fn:doc
function for XML documents, and the fn:unparsed-text
and fn:parsed-text-lines
functions to access plain text files. See Example 7-15.
If an error occurs while compiling the query, the function raises an error. If an error occurs while evaluating the query, the error is logged (not raised), and an empty array is returned.
The input that the query processes. The value can be an XML STRING
or a STRUCT
of variable values:
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-17.
STRUCT
: A STRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING
, and the value fields can be any supported primitive. See "Data Type Conversions" and Example 7-18.
The first item in the result of the query is cast to the XML schema type that maps to the primitive type of the function. If the query returns multiple items, then all but the first are ignored.
Return Value
A Hive primitive value, which is the first item returned by the query, converted to an atomic value. If the result of the query is empty, then the return value is NULL
.
Examples
Example 7-17 Using a STRING Binding
This example parses and binds the input XML string to the initial context item of the query x/y
:
hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y></x>") FROM src LIMIT 1;
.
.
.
"hello"
The following are string binding examples that use other primitive functions:
hive> SELECT xml_query_as_int("x/y", "<x><y>123</y></x>") FROM src LIMIT 1; . . . 123 hive> SELECT xml_query_as_double("x/y", "<x><y>12.3</y></x>") FROM src LIMIT 1; . . . 12.3 hive> SELECT xml_query_as_boolean("x/y", "<x><y>true</y></x>") FROM src LIMIT 1; . . . true
Example 7-18 Using a STRUCT Binding
In this example, the second argument is a STRUCT
that defines two query variables, $data
and $value
. The values of the variables in the STRUCT
are converted to XML schema types as described in "Data Type Conversions."
hive> SELECT xml_query_as_string( "fn:parse-xml($data)/x/y[@id = $value]", struct( "data", "<x><y id='1'>hello</y><z/><y id='2'>world</y></x>", "value", 2 ) ) FROM src LIMIT 1; . . . world
Example 7-19 Returning Multiple Query Results
This example returns only the first item (hello) from the query. The second item (world) is discarded.
hive> SELECT xml_query_as_string("x/y", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
.
.
.
hello
Example 7-20 Returning Empty Query Results
This example returns NULL
because the result of the query is empty:
hive> SELECT xml_query_as_string("x/foo", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1;
.
.
.
NULL
Example 7-21 Obtaining Serialized XML
These examples use the fn:serialize
function to return complex XML elements as a STRING
value:
hive> SELECT xml_query_as_string("fn:serialize(x/y[1])", "<x><y>hello</y><z/><y>world</y></x>") FROM src LIMIT 1; . . . "<y>hello</y>" hive> SELECT xml_query_as_string( "fn:serialize(<html><head><title>{$desc}</title></head><body>Name: {$name}</body></html>)", struct( "desc", "Employee Details", "name", "John Doe" ) ) FROM src LIMIT 1; ... <html><head><title>Employee Details</title></head><body>Name: John Doe</body></html>
Example 7-22 Accessing the Hadoop Distributed Cache
This example adds a file named test.xml
to the distributed cache, and then queries it using the fn:doc
function. The file contains this value:
<x><y>hello</y><z/><y>world</y></x>
Hive> ADD FILE test.xml; Added resource: test.xml Hive> SELECT xml_query_as_string("fn:doc('test.xml')/x/y[1]", NULL) FROM src LIMIT 1; . . . hello
Example 7-23 Results of a Failed Query
This example returns NULL
because </invalid
is missing an angle bracket. An XML parsing error is written to the log:
Hive> SELECT xml_query_as_string("x/y", "<x><y>hello</invalid") FROM src LIMIT 1;
.
.
.
NULL
This example returns NULL
because foo
cannot be cast as xs:float
. A cast error is written to the log:
Hive> SELECT xml_query_as_float("x/y", "<x><y>foo</y></x>") FROM src LIMIT 1;
.
.
.
NULL
A user-defined table-generating function (UDTF) that maps an XML value to zero or more table rows. This function enables nested repeating elements in XML to be mapped to Hive table rows.
Signature
xml_table( STRUCT? namespaces, STRING query, {STRUCT | STRING} bindings, STRUCT? columns )
Description
Identifies the namespaces that the query and column expressions can use. Optional.
The value is a STRUCT
with an even number of STRING
fields. Each pair of fields defines a namespace binding (prefix, URI) that can be used by the query or the column expressions. See Example 7-26.
An XQuery or XPath expression that generates a table row for each returned value. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
If a dynamic error occurs during query processing, then the function does not raise an error, but logs it the first time. Subsequent dynamic errors are not logged.
The input that the query processes. The value can be an XML STRING
or a STRUCT
of variable values:
STRING
: The string is bound to the initial context item of the query as XML. See Example 7-24.
STRUCT
: A STRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING
, and the value fields can be any supported primitive. See "Data Type Conversions."
The XQuery or XPath expressions that define the columns of the generated rows. Optional.
The value is a STRUCT
that contains the additional XQuery expressions. The XQuery expressions must be constant STRING
values, because they are only read the first time the function is evaluated. For each column expression in the STRUCT
, there is one column in the table.
For each item returned by the query, the column expressions are evaluated with the current item as the initial context item of the expression. The results of the column expressions are converted to STRING
values and become the values of the row.
If the result of a column expression is empty or if a dynamic error occurs while evaluating the column expression, then the corresponding column value is NULL
. If a column expression returns more than one item, then all but the first are ignored.
Omitting the columns argument is the same as specifying 'struct(".")'
. See Example 7-25.
Return Value
One table row for each item returned by the query argument.
Notes
The XML table adapter enables Hive tables to be created over large XML files in HDFS. See "Hive CREATE TABLE Syntax for XML Tables".
Examples
Note:
You could use the xml_query_as_string
function to achieve the same result in this example. However, xml_table
is more efficient, because a single function call sets all three column values and parses the input XML only once for each row. The xml_query_as_string
function requires a separate function call for each of the three columns and reparses the same input XML value each time.
Example 7-24 Using a STRING Binding
The query "x/y" returns two <y>
elements, therefore two table rows are generated. Because there are two column expressions ("./z", "./w"), each row has two columns.
hive> SELECT xml_table( "x/y", "<x> <y> <z>a</z> <w>b</w> </y> <y> <z>c</z> </y> </x> ", struct("./z", "./w") ) AS (z, w) FROM src; . . . a b c NULL
Example 7-25 Using the Columns Argument
The following two queries are equivalent. The first query explicitly specifies the value of the columns argument:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>", struct(".") ) AS (y) FROM src; . . . hello world
The second query omits the columns argument, which defaults to struct(".")
:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>" ) AS (y) FROM src; . . . hello world
Example 7-26 Using the Namespaces Argument
This example specifies the optional namespaces argument, which identifies an ns
prefix and a URI of http://example.org
.
hive> SELECT xml_table( struct("ns", "http://example.org"), "ns:x/ns:y", "<x xmlns='http://example.org'><y><z/></y><y><z/><z/></y></x>", struct("count(./ns:z)") ) AS (y) FROM src; . . . 1 2
Example 7-27 Querying a Hive Table of XML Documents
This example queries a table named COMMENTS3
, which has a single column named XML_STR
of type STRING
. It contains these three rows:
hive> SELECT xml_str FROM comments3;
<comment id="12345" user="john" text="It is raining:("/>
<comment id="56789" user="kelly" text="I won the lottery!"><like user="john"/><like user="mike"/></comment>
<comment id="54321" user="mike" text="Happy New Year!"><like user="laura"/></comment>
The following query shows how to extract the user, text, and number of likes from the COMMENTS3
table.
hive> SELECT t.id, t.usr, t.likes FROM comments3 LATERAL VIEW xml_table( "comment", comments.xml_str, struct("./@id", "./@user", "fn:count(./like)") ) t AS id, usr, likes; 12345 john 0 56789 kelly 2 54321 mike 1
Example 7-28 Mapping Nested XML Elements to Table Rows
This example shows how to use xml_table
to flatten nested, repeating XML elements into table rows. See the previous example for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 56789 kelly john 56789 kelly mike 54321 mike laura
Example 7-29 Mapping Optional Nested XML Elements to Table Rows
This example is a slight modification of the previous example that produces a row even when a comment has no likes. See Example 7-27 for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like allowing empty in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 12345 john 56789 kelly john 56789 kelly mike 54321 mike laura
Example 7-30 Creating a New View
You can create views and new tables using xml_table
, the same as any table-generating function. This example creates a new view named COMMENTS_LIKES
from the COMMENTS
table:
hive> CREATE VIEW comments_likes AS SELECT xml_table( "comment", comments.xml_str, struct("./@id", "count(./like)") ) AS (id, likeCt) FROM comments;
This example queries the new view:
> SELECT * FROM comments_likes WHERE CAST(likeCt AS INT) != 0; 56789 2 54321 1
Example 7-31 Accessing the Hadoop Distributed Cache
You can access XML documents and text files added to the distributed cache by using the fn:doc
and fn:unparsed-text
functions.
This example queries a file named test.xml that contains this string:
<x><y>hello</y><z/><y>world</y></x>
hive> ADD FILE test.xml; Added resource: test.xml hive> SELECT xml_table("fn:doc('test.xml')/x/y", NULL) AS y FROM src; . . . hello world