A user-defined table-generating function (UDTF) that maps an XML value to zero or more table rows. This function enables nested repeating elements in XML to be mapped to Hive table rows.
xml_table( STRUCT? namespaces, STRING query, {STRUCT | STRING} bindings, STRUCT? columns )
Identifies the namespaces that the query and column expressions can use. Optional.
The value is a STRUCT
with an even number of STRING
fields. Each pair of fields defines a namespace binding (prefix, URI) that can be used by the query or the column expressions. See Example 3.
An XQuery or XPath expression that generates a table row for each returned value. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.
If a dynamic error occurs during query processing, then the function does not raise an error, but logs it the first time. Subsequent dynamic errors are not logged.
The input that the query processes. The value can be an XML STRING
or a STRUCT
of variable values:
STRING
: The string is bound to the initial context item of the query as XML. See Example 1.
STRUCT
: A STRUCT
with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING
, and the value fields can be any supported primitive. See "Data Type Conversions."
The XQuery or XPath expressions that define the columns of the generated rows. Optional.
The value is a STRUCT
that contains the additional XQuery expressions. The XQuery expressions must be constant STRING
values, because they are only read the first time the function is evaluated. For each column expression in the STRUCT
, there is one column in the table.
For each item returned by the query, the column expressions are evaluated with the current item as the initial context item of the expression. The results of the column expressions are converted to STRING
values and become the values of the row.
If the result of a column expression is empty or if a dynamic error occurs while evaluating the column expression, then the corresponding column value is NULL
. If a column expression returns more than one item, then all but the first are ignored.
Omitting the columns argument is the same as specifying 'struct(".")'
. See Example 2.
The XML table adapter enables Hive tables to be created over large XML files in HDFS. See "Hive CREATE TABLE Syntax for XML Tables".
The query "x/y" returns two <y>
elements, therefore two table rows are generated. Because there are two column expressions ("./z", "./w"), each row has two columns.
hive> SELECT xml_table( "x/y", "<x> <y> <z>a</z> <w>b</w> </y> <y> <z>c</z> </y> </x> ", struct("./z", "./w") ) AS (z, w) FROM src; . . . a b c NULL
The following two queries are equivalent. The first query explicitly specifies the value of the columns argument:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>", struct(".") ) AS (y) FROM src; . . . hello world
The second query omits the columns argument, which defaults to struct(".")
:
hive> SELECT xml_table( "x/y", "<x><y>hello</y><y>world</y></x>" ) AS (y) FROM src; . . . hello world
This example specifies the optional namespaces argument, which identifies an ns
prefix and a URI of http://example.org
.
hive> SELECT xml_table( struct("ns", "http://example.org"), "ns:x/ns:y", "<x xmlns='http://example.org'><y><z/></y><y><z/><z/></y></x>", struct("count(./ns:z)") ) AS (y) FROM src; . . . 1 2
This example queries a table named COMMENTS
, which has a single column named XML_STR
of type STRING
. It contains these three rows:
hive> SELECT xml_str FROM comments3;
<comment id="12345" user="john" text="It is raining:("/>
<comment id="56789" user="kelly" text="I won the lottery!"><like user="john"/><like user="mike"/></comment>
<comment id="54321" user="mike" text="Happy New Year!"><like user="laura"/></comment>
The following query shows how to extract the user, text, and number of likes from the COMMENTS
table.
hive> SELECT t.id, t.usr, t.likes FROM comments3 LATERAL VIEW xml_table( "comment", comments.xml_str, struct("./@id", "./@user", "fn:count(./like)") ) t AS id, usr, likes; 12345 john 0 56789 kelly 2 54321 mike 1
Note:
You could use thexml_query_as_string
function to achieve the same result in this example. However, xml_table
is more efficient, because a single function call sets all three column values and parses the input XML only once for each row. The xml_query_as_string
function requires a separate function call for each of the three columns and reparses the same input XML value each time.This example shows how to use xml_table
to flatten nested, repeating XML elements into table rows. See Example 4 for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 56789 kelly john 56789 kelly mike 54321 mike laura
This example is a slight modification of Example 5 that produces a row even when a comment has no likes. See Example 4 for the COMMENTS
table.
> SELECT t.i, t.u, t.l FROM comments3 LATERAL VIEW xml_table ( "let $comment := ./comment for $like allowing empty in $comment/like return <r> <id>{$comment/@id/data()}</id> <user>{$comment/@user/data()}</user> <like>{$like/@user/data()}</like> </r> ", comments.xml_str, struct("./id", "./user", "./like") ) t AS i, u, l; 12345 john 56789 kelly john 56789 kelly mike 54321 mike laura
You can create views and new tables using xml_table
, the same as any table-generating function. This example creates a new view named COMMENTS_LIKES
from the COMMENTS
table:
hive> CREATE VIEW comments_likes AS SELECT xml_table( "comment", comments.xml_str, struct("./@id", "count(./like)") ) AS (id, likeCt) FROM comments;
This example queries the new view:
> SELECT * FROM comments_likes WHERE CAST(likeCt AS INT) != 0; 56789 2 54321 1
You can access XML documents and text files added to the distributed cache by using the fn:doc
and fn:unparsed-text
functions.
This example queries a file named test.xml that contains this string:
<x><y>hello</y><z/><y>world</y></x>
hive> ADD FILE test.xml; Added resource: test.xml hive> SELECT xml_table("fn:doc('test.xml')/x/y", NULL) AS y FROM src; . . . hello world