xml_table

A user-defined table-generating function (UDTF) that maps an XML value to zero or more table rows. This function enables nested repeating elements in XML to be mapped to Hive table rows.

Signature

xml_table( 
   STRUCT? namespaces,
   STRING query, 
   {STRUCT | STRING} bindings,
   STRUCT? columns
)

Description

namespaces

Identifies the namespaces that the query and column expressions can use. Optional.

The value is a STRUCT with an even number of STRING fields. Each pair of fields defines a namespace binding (prefix, URI) that can be used by the query or the column expressions. See Example 3.

query

An XQuery or XPath expression that generates a table row for each returned value. It must be a constant value, because it is only read the first time the function is evaluated. The initial query string is compiled and reused in all subsequent calls.

If a dynamic error occurs during query processing, then the function does not raise an error, but logs it the first time. Subsequent dynamic errors are not logged.

bindings

The input that the query processes. The value can be an XML STRING or a STRUCT of variable values:

STRING: The string is bound to the initial context item of the query as XML. See Example 1.
STRUCT: A STRUCT with an even number of fields. Each pair of fields defines a variable binding (name, value) for the query. The name fields must be type STRING, and the value fields can be any supported primitive. See "Data Type Conversions."

columns

The XQuery or XPath expressions that define the columns of the generated rows. Optional.

The value is a STRUCT that contains the additional XQuery expressions. The XQuery expressions must be constant STRING values, because they are only read the first time the function is evaluated. For each column expression in the STRUCT, there is one column in the table.

For each item returned by the query, the column expressions are evaluated with the current item as the initial context item of the expression. The results of the column expressions are converted to STRING values and become the values of the row.

If the result of a column expression is empty or if a dynamic error occurs while evaluating the column expression, then the corresponding column value is NULL. If a column expression returns more than one item, then all but the first are ignored.

Omitting the columns argument is the same as specifying 'struct(".")'. See Example 2.

Return Value

One table row for each item returned by the query argument.

Notes

The XML table adapter enables Hive tables to be created over large XML files in HDFS. See "Hive CREATE TABLE Syntax for XML Tables".

Examples

Example 1 Using a STRING Binding

The query "x/y" returns two <y> elements, therefore two table rows are generated. Because there are two column expressions ("./z", "./w"), each row has two columns.

hive> SELECT xml_table(
   "x/y",
   "<x>
      <y>
         <z>a</z>
         <w>b</w>
      </y>
      <y>
         <z>c</z>
      </y>
   </x>
   ",
   struct("./z", "./w")
 ) AS (z, w)
 FROM src;
     .
     .
     .
a        b
c        NULL

Example 2 Using the Columns Argument

The following two queries are equivalent. The first query explicitly specifies the value of the columns argument:

hive> SELECT xml_table(
      "x/y",
      "<x><y>hello</y><y>world</y></x>",
      struct(".")
   ) AS (y)
   FROM src;
     .
     .
     .
hello
world

The second query omits the columns argument, which defaults to struct("."):

hive> SELECT xml_table(
      "x/y",
      "<x><y>hello</y><y>world</y></x>"
   ) AS (y)
   FROM src;
     .
     .
     .
hello
world

Example 3 Using the Namespaces Argument

This example specifies the optional namespaces argument, which identifies an ns prefix and a URI of http://example.org.

hive> SELECT xml_table(
     struct("ns", "http://example.org"),
     "ns:x/ns:y",
     "<x xmlns='http://example.org'><y><z/></y><y><z/><z/></y></x>",
     struct("count(./ns:z)")
  ) AS (y)
  FROM src;
     .
     .
     .
1
2

Example 4 Querying a Hive Table of XML Documents

This example queries a table named COMMENTS, which has a single column named XML_STR of type STRING. It contains these three rows:

hive> SELECT xml_str FROM comments3;

<comment id="12345" user="john" text="It is raining:("/>
<comment id="56789" user="kelly" text="I won the lottery!"><like user="john"/><like user="mike"/></comment>
<comment id="54321" user="mike" text="Happy New Year!"><like user="laura"/></comment>

The following query shows how to extract the user, text, and number of likes from the COMMENTS table.

hive> SELECT t.id, t.usr, t.likes
     FROM comments3 LATERAL VIEW xml_table(
        "comment",
        comments.xml_str, 
        struct("./@id", "./@user", "fn:count(./like)")
     ) t AS id, usr, likes;

12345   john    0
56789   kelly   2
54321   mike    1

Note:

You could use the xml_query_as_string function to achieve the same result in this example. However, xml_table is more efficient, because a single function call sets all three column values and parses the input XML only once for each row. The xml_query_as_string function requires a separate function call for each of the three columns and reparses the same input XML value each time.

Example 5 Mapping Nested XML Elements to Table Rows

This example shows how to use xml_table to flatten nested, repeating XML elements into table rows. See Example 4 for the COMMENTS table.

> SELECT t.i, t.u, t.l
    FROM comments3 LATERAL VIEW xml_table (
       "let $comment := ./comment
        for $like in $comment/like
        return
           <r>
              <id>{$comment/@id/data()}</id>
              <user>{$comment/@user/data()}</user>
              <like>{$like/@user/data()}</like>
           </r>
       ",
       comments.xml_str,
       struct("./id", "./user", "./like")
    ) t AS i, u, l;

56789   kelly   john
56789   kelly   mike
54321   mike    laura

Example 6 Mapping Optional Nested XML Elements to Table Rows

This example is a slight modification of Example 5 that produces a row even when a comment has no likes. See Example 4 for the COMMENTS table.

> SELECT t.i, t.u, t.l
     FROM comments3 LATERAL VIEW xml_table (
        "let $comment := ./comment
         for $like allowing empty in $comment/like
         return 
            <r>
               <id>{$comment/@id/data()}</id>
               <user>{$comment/@user/data()}</user>
               <like>{$like/@user/data()}</like>
            </r>
        ",
        comments.xml_str, 
        struct("./id", "./user", "./like")
     ) t AS i, u, l;

12345   john
56789   kelly   john
56789   kelly   mike
54321   mike    laura

Example 7 Creating a New View

You can create views and new tables using xml_table, the same as any table-generating function. This example creates a new view named COMMENTS_LIKES from the COMMENTS table:

hive> CREATE VIEW comments_likes AS 
     SELECT xml_table(
        "comment", 
         comments.xml_str, 
         struct("./@id", "count(./like)")
     ) AS (id, likeCt)
     FROM comments;

This example queries the new view:

> SELECT * FROM comments_likes
     WHERE CAST(likeCt AS INT) != 0;

56789   2
54321   1

Example 8 Accessing the Hadoop Distributed Cache

You can access XML documents and text files added to the distributed cache by using the fn:doc and fn:unparsed-text functions.

This example queries a file named test.xml that contains this string:

<x><y>hello</y><z/><y>world</y></x>

hive> ADD FILE test.xml;
Added resource: test.xml
hive> SELECT xml_table("fn:doc('test.xml')/x/y", NULL) AS y FROM src;
     .
     .
     .
hello
world