This chapter explains how to use the XML extensions for Apache Hive provided on Oracle Big Data Appliance with Oracle XQuery for Hadoop. The chapter contains the following sections:
Note:
The features described in this chapter are available only on Oracle Big Data Appliance. You cannot use them on other Hadoop clusters.Oracle XQuery for Hadoop provides XML processing support that enables you to do the following:
Query large XML files in HDFS as Hive tables
Query XML strings in Hive tables
Query XML file resources in the Hadoop distributed cache
Efficiently extract atomic values from XML without using expensive DOM parsing
Retrieve, generate, and transform complex XML elements
Generate multiple table rows from a single XML value
Manage missing and dirty data in XML
The XML extensions also support these W3C modern standards:
XQuery 1.0
XQuery Update Facility 1.0 (transform expressions)
XPath 2.0
XML Schema 1.0
XML Namespaces
The XML extensions have two components:
XML InputFormat and SerDe for creating XML tables
XML function library
To enable the Oracle XQuery for Hadoop extensions, use the --auxpath
and -i
arguments when starting Hive:
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The first time you use the extensions, you might want to verify that they are accessible. The following procedure creates a table named SRC, loads one row into it, and then calls the xml_query function.
The SRC table is needed only to fulfill a SELECT
syntax requirement. It is like the DUAL
table in Oracle Database, which is referenced in SELECT
statements to test SQL functions.
To verify that the extensions are accessible:
Log in to an Oracle Big Data Appliance server where you plan to work.
Create a text file named src.txt that contains one line:
$ echo "XXX" > src.txt
Start the Hive command-line interface (CLI):
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The init.sql file contains the CREATE TEMPORARY FUNCTION
statements that declare the XML functions.
Create a simple table and load data from src.txt:
hive> CREATE TABLE src(dummy STRING); hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
Call an Oracle XQuery for Hadoop function for Hive:
hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
If the extensions are accessible, then the query returns ["123"]
.
This section describes how you can use the Hive CREATE TABLE
statement to create tables over large XML documents.
Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.
To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you may get errors or incorrect results.
The following is the basic syntax to use when creating a Hive table over XML files:
CREATE TABLE table_name (columns) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES(configuration)
columns: All column types in an XML table must be one of the Hive primitive types given in "About Data Type Conversions."
configuration: Any of the properties described in "CREATE TABLE Configuration Properties."
Note:
Inserting into XML tables is not supported.CREATE TABLE Configuration Properties
Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.
This example maps each element named foo
in the XML to a single row in the Hive table:
"oxh-elements" = "foo"
The next example maps each element named either foo
or bar
in the XML to a row in the Hive table:
"oxh-elements" = "foo, bar"
Specifies how an element selected by the oxh-elements
property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.
Be sure to check log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL
. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.
In the following example, the oxh-elements
property specifies that each element named foo
in the XML is mapped to a single row in the table. The oxh-column
properties specify that the Hive table column named BAR
gets the value of the child element named bar
converted to STRING
, and the column named ZIP
gets the value of the child element named zip
converted to INT
.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "./bat", "oxh-column.zip" = "./zip" )
In this modified definition of the zip
column, the column receives a value of -1 if the foo
element does not have a child zip
element, or the zip
element contains a nonnumeric value.
"oxh-column.zip" = " if (./zip castable as xs:int) then xs:int(./zip) else -1 "
Any column of the table that does not have a corresponding oxh-column
property behaves as if the following property is specified:
"oxh-column.name" = "(./name | ./@name)[1]"
Thus, the default behavior is to select the first child element or attribute that matches the table column name.
The following two table definitions are equivalent:
Table Definition 1:
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "(./bar | ./@bar)[1]", "oxh-column.zip" = "(./zip | ./@zip)[1]" )
Table Definition 2:
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo" )
Defines a namespace binding.
This example binds the prefix myns
to the namespace http://example.org
:
"oxh-namespace.myns" = "http://example.org"
You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element
and oxh-column
property values can also reference them.
In the following examples, only foo
elements in the http://example.org
namespace are mapped to table rows:
"oxh-namespace.myns" = "http://example.org", "oxh-elements" = "myns:foo", "oxh-column.bar" = "./myns:bar"
Defines a set of entity reference definitions.
In the following example, entity references in the XML are expanded from &foo;
to "foo value" and from &bar;
to "bar value".
"oxh-entity.foo" = "foo value" "oxh-entity.bar" = "bar value"
Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.
This example sets the character set:
"oxh-charset" = "ISO-8859-1"
This section includes some examples using a simple data set, and other examples using a very large and complex real-world data set:
These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment
element in the document has one or more like
elements that indicate that the user liked the comment.
<comments> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments>
In the CREATE TABLE
examples, the comments.xml input file is in the current working directory of the local file system.
This example creates a table named COMMENTS
with a row for each comment, with the user, number of likes, and text.
hive> CREATE TABLE comments (usr STRING, content STRING, likeCt INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likeCt" = "fn:count(./like)" );
The Hive LOAD DATA
command loads comments.xml into the COMMENTS
table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments;
The following query shows the content of the COMMENTS
table.
hive> SELECT usr, content, likeCt FROM comments;
john It is raining :( 0
kelly I won the lottery! 2
mike Happy New Year! 1
This example is like Example 1, except that the like
elements are produced as XML in a STRING
column.
hive> CREATE TABLE comments2 (usr STRING, content STRING, likes STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)" );
The Hive LOAD DATA
command loads comments.xml into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
The following query shows the content of the COMMENTS2
table.
hive> SELECT usr, content, likes FROM comments2;
john It is raining :( <likes/>
kelly I won the lottery! <likes><like user="john"/><like user="mike"/></likes>
mike Happy New Year! <likes><like user="laura"/></likes>
The next query extracts the user names from the like
elements:
hive> SELECT usr, t.user FROM comments2 LATERAL VIEW xml_table("likes/like", comments2.likes, struct("./@user")) t AS user; kelly john kelly mike mike laura
This example creates a table named COMMENTS3
with a row for each comment, and produces the XML in a single STRING
column.
hive> CREATE TABLE comments3 (xml STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.xml" = "fn:serialize(.)" );
The Hive LOAD DATA
command loads comments.xml into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;
The following query shows the contents of the XML column:
hive> SELECT xml FROM comments3;
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
The next query extracts the IDs and converts them to integers:
hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
12345
56789
54321
These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node
, way
, and relation
elements.
In these examples, the OpenStreetMap XML files are stored in the /user/name/osm HDFS directory.
See Also:
To download OpenStreetMap data:
For information about the OpenStreetMap XML format:
This example creates a table over OpenStreetMap XML with one row for each node
element as follows:
The id
, lat
, lon
, and user
attributes of the node
element are mapped to table columns.
The year is extracted from the timestamp
attribute and mapped to the YEAR
column. If a node does not have a timestamp
attribute, then -1
is used for the year.
If the node
element has any child tag
elements, then they are stored as an XML string in the TAGS
column. If node
has no child tag
elements, then column value is NULL
.
hive> CREATE EXTERNAL TABLE nodes ( id BIGINT, latitude DOUBLE, longitude DOUBLE, year SMALLINT, tags STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node", "oxh-column.id" = "./@id", "oxh-column.latitude" = "./@lat", "oxh-column.longitude" = "./@lon", "oxh-column.year" = " if (fn:exists(./@timestamp)) then fn:year-from-dateTime(xs:dateTime(./@timestamp)) else -1 ", "oxh-column.tags" = " if (fn:exists(./tag)) then fn:serialize(<tags>{./tag}</tags>) else () " );
The following query returns the number of nodes per year:
hive> SELECT year, count(*) FROM nodes GROUP BY year;
This query returns the total number of tags across nodes:
hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
In OpenStreetMap XML, the node
, way
, and relation
elements share a set of common attributes such as the user who contributed the data. The next table produces one row for each node
, way
, and relation
element.
See Also:
For a description of the OpenStreetMap elements and attributes:hive> CREATE EXTERNAL TABLE osm ( id BIGINT, uid BIGINT, type STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node, way, relation", "oxh-column.id" = "./@id", "oxh-column.uid" = "./@uid", "oxh-column.type" = "./name()" );
The following query returns the number of node
, way
, and relation
elements. The TYPE column is set to the name of the selected element, which is either node, way, or relation.
hive> SELECT type, count(*) FROM osm GROUP BY type;
This query returns the number of distinct user IDs:
hive> SELECT count(*) FROM ( SELECT uid FROM osm GROUP BY uid ) t;