This chapter explains how to use the XML extensions for Apache Hive provided on Oracle Big Data Appliance with Oracle XQuery for Hadoop. The chapter contains the following sections:
Note:
The features described in this chapter are available only on Oracle Big Data Appliance. They are not included in the downloadable Oracle XQuery for Hadoop installation package.The XML Extensions for Hive provide XML processing support that enables you to do the following:
Query large XML files in HDFS as Hive tables
Query XML strings in Hive tables
Query XML file resources in the Hadoop distributed cache
Efficiently extract atomic values from XML without using expensive DOM parsing
Retrieve, generate, and transform complex XML elements
Generate multiple table rows from a single XML value
Manage missing and dirty data in XML
The XML extensions also support these W3C modern standards:
XQuery 1.0
XQuery Update Facility 1.0 (transform expressions)
XPath 2.0
XML Schema 1.0
XML Namespaces
The XML extensions have two components:
XML InputFormat and SerDe for creating XML tables
XML function library
To enable the Oracle XQuery for Hadoop extensions, use the --auxpath
and -i
arguments when starting Hive:
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC
, loads one row into it, and calls the xml_query
function.
To verify that the extensions are accessible:
Log in to an Oracle Big Data Appliance server where you plan to work.
Create a text file named src.txt
that contains one line:
$ echo "XXX" > src.txt
Start the Hive command-line interface (CLI):
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The init.sql
file contains the CREATE TEMPORARY FUNCTION
statements that declare the XML functions.
Create a simple table:
hive> CREATE TABLE src(dummy STRING);
The SRC
table is needed only to fulfill a SELECT
syntax requirement. It is like the DUAL
table in Oracle Database, which is referenced in SELECT
statements to test SQL functions.
Load data from src.txt
into the table:
hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
Query the table using Hive SELECT
statements:
hive> SELECT * FROM src; OK xxx
Call an Oracle XQuery for Hadoop function for Hive. This example calls the xml_query
function to parse an XML string:
hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
.
.
.
["123"]
If the extensions are accessible, then the query returns ["123"]
, as shown in the example.
The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:
xml_query
: Returns the result of a query as an array of STRING
values.
xml_query_as_primitive
: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it.
xml_exists
: Tests if the result of a query is empty
xml_table
: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.
This section describes how you can use the Hive CREATE TABLE
statement to create tables over large XML documents.
Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.
To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.
The following is the basic syntax of the Hive CREATE TABLE
statement for creating a Hive table over XML files:
CREATE TABLE table_name (columns) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES(configuration)
Parameters
All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions."
Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas.
Note:
Inserting data into XML tables is not supported.Use these configuration properties in the configuration
parameter of the CREATE TABLE
command.
Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.
This example maps each element named foo
in the XML to a single row in the Hive table:
"oxh-elements" = "foo"
The next example maps each element named either foo
or bar
in the XML to a row in the Hive table:
"oxh-elements" = "foo, bar"
Specifies how an element selected by the oxh-elements
property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.
Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL
. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.
Any column of the table that does not have a corresponding oxh-column
property behaves as if the following property is specified:
"oxh-column.name" = "(./name | ./@name)[1]"
Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."
Defines a namespace binding.
This example binds the prefix myns
to the namespace http://example.org
:
"oxh-namespace.myns" = "http://example.org"
You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element
and oxh-column
property values can also reference them.
In the following example, only foo
elements in the http://example.org
namespace are mapped to table rows:
"oxh-namespace.myns" = "http://example.org", "oxh-elements" = "myns:foo", "oxh-column.bar" = "./myns:bar"
Defines a set of entity reference definitions.
In the following example, entity references in the XML are expanded from &foo;
to "foo value" and from &bar;
to "bar value".
"oxh-entity.foo" = "foo value" "oxh-entity.bar" = "bar value"
Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.
All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.
This example defines the character set:
"oxh-charset" = "ISO-8859-1"
This section includes the following examples:
This example shows how to map XML elements to column names.
In the following table definition, the oxh-elements
property specifies that each element named foo
in the XML is mapped to a single row in the table. The oxh-column
properties specify that a Hive table column named BAR
gets the value of the child element named bar
converted to STRING
, and the column named ZIP
gets the value of the child element named zip
converted to INT
.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "./bat", "oxh-column.zip" = "./zip" )
In this modified definition of the ZIP
column, the column receives a value of -1 if the foo
element does not have a child zip
element, or if the zip
element contains a nonnumeric value:
"oxh-column.zip" = " if (./zip castable as xs:int) then xs:int(./zip) else -1 "
The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR
and ZIP
columns.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "(./bar | ./@bar)[1]", "oxh-column.zip" = "(./zip | ./@zip)[1]" )
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo" )
These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment
element in the document has one or more like
elements that indicate that the user liked the comment.
<comments> <comment id="12345" user="john" text="It is raining :( "/> <comment id="56789" user="kelly" text="I won the lottery!"> <like user="john"/> <like user="mike"/> </comment> <comment id="54321" user="mike" text="Happy New Year!"> <like user="laura"/> </comment> </comments>
In the CREATE TABLE
examples, the comments.xml
input file is in the current working directory of the local file system.
The following Hive CREATE TABLE
command creates a table named COMMENTS
with a row for each comment containing the user names, text, and number of likes:
hive> CREATE TABLE comments (usr STRING, content STRING, likeCt INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likeCt" = "fn:count(./like)" );
The Hive LOAD DATA
command loads comments.xml
into the COMMENTS
table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments; ]
The following query shows the content of the COMMENTS
table.
hive> SELECT usr, content, likeCt FROM comments;
.
.
.
john It is raining :( 0
kelly I won the lottery! 2
mike Happy New Year! 1
This CREATE TABLE
command is like Example 1, except that the like
elements are produced as XML in a STRING
column.
hive> CREATE TABLE comments2 (usr STRING, content STRING, likes STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
The following query shows the content of the COMMENTS2
table.
hive> SELECT usr, content, likes FROM comments2;
.
.
.
john It is raining :( <likes/>
kelly I won the lottery! <likes><like user="john"/><like user="mike"/></likes>
mike Happy New Year! <likes><like user="laura"/></likes>
The next query extracts the user names from the like
elements:
hive> SELECT usr, t.user FROM comments2 LATERAL VIEW > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user; . . . kelly john kelly mike mike laura
This command creates a table named COMMENTS3
with a row for each comment, and produces the XML in a single STRING
column.
hive> CREATE TABLE comments3 (xml STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.xml" = "fn:serialize(.)" );
The Hive LOAD DATA
command loads comments.xml
into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;
The following query shows the contents of the XML column:
hive> SELECT xml FROM comments3;
.
.
.
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
The next query extracts the IDs and converts them to integers:
hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
.
.
.
12345
56789
54321
These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node
, way
, and relation
elements.
In these examples, the OpenStreetMap XML files are stored in the /user/name/osm
HDFS directory.
See Also:
To download OpenStreetMap data:
For information about the OpenStreetMap XML format:
This example creates a table over OpenStreetMap XML with one row for each node
element as follows:
The id
, lat
, lon
, and user
attributes of the node
element are mapped to table columns.
The year is extracted from the timestamp
attribute and mapped to the YEAR
column. If a node does not have a timestamp
attribute, then -1
is used for the year.
If the node
element has any child tag
elements, then they are stored as an XML string in the TAGS
column. If node
has no child tag
elements, then column value is NULL
.
hive> CREATE EXTERNAL TABLE nodes ( id BIGINT, latitude DOUBLE, longitude DOUBLE, year SMALLINT, tags STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node", "oxh-column.id" = "./@id", "oxh-column.latitude" = "./@lat", "oxh-column.longitude" = "./@lon", "oxh-column.year" = " if (fn:exists(./@timestamp)) then fn:year-from-dateTime(xs:dateTime(./@timestamp)) else -1 ", "oxh-column.tags" = " if (fn:exists(./tag)) then fn:serialize(<tags>{./tag}</tags>) else () " );
The following query returns the number of nodes per year:
hive> SELECT year, count(*) FROM nodes GROUP BY year;
This query returns the total number of tags across nodes:
hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
In OpenStreetMap XML, the node
, way
, and relation
elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node
, way
, and relation
element.
See Also:
For a description of the OpenStreetMap elements and attributes:hive> CREATE EXTERNAL TABLE osm ( id BIGINT, uid BIGINT, type STRING ) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION '/user/name/osm' TBLPROPERTIES ( "oxh-elements" = "node, way, relation", "oxh-column.id" = "./@id", "oxh-column.uid" = "./@uid", "oxh-column.type" = "./name()" );
The following query returns the number of node
, way
, and relation
elements. The TYPE
column is set to the name of the selected element, which is either node
, way
, or relation
.
hive> SELECT type, count(*) FROM osm GROUP BY type;
This query returns the number of distinct user IDs:
hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;