This chapter explains how to use the XML extensions for Apache Hive provided on Oracle Big Data Appliance with Oracle XQuery for Hadoop. The chapter contains the following sections:
Note:
The features described in this chapter are available only on Oracle Big Data Appliance. They are not included in the downloadable Oracle XQuery for Hadoop installation package.The XML Extensions for Hive provide XML processing support that enables you to do the following:
Query large XML files in HDFS as Hive tables
Query XML strings in Hive tables
Query XML file resources in the Hadoop distributed cache
Efficiently extract atomic values from XML without using expensive DOM parsing
Retrieve, generate, and transform complex XML elements
Generate multiple table rows from a single XML value
Manage missing and dirty data in XML
The XML extensions also support these W3C modern standards:
XQuery 1.0
XQuery Update Facility 1.0 (transform expressions)
XPath 2.0
XML Schema 1.0
XML Namespaces
The XML extensions have two components:
XML InputFormat and SerDe for creating XML tables
XML function library
To enable the Oracle XQuery for Hadoop extensions, use the --auxpath and -i arguments when starting Hive:
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC, loads one row into it, and calls the xml_query function.
To verify that the extensions are accessible:
Log in to an Oracle Big Data Appliance server where you plan to work.
Create a text file named src.txt that contains one line:
$ echo "XXX" > src.txt
Start the Hive command-line interface (CLI):
$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
The init.sql file contains the CREATE TEMPORARY FUNCTION statements that declare the XML functions.
Create a simple table:
hive> CREATE TABLE src(dummy STRING);
The SRC table is needed only to fulfill a SELECT syntax requirement. It is like the DUAL table in Oracle Database, which is referenced in SELECT statements to test SQL functions.
Load data from src.txt into the table:
hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
Query the table using Hive SELECT statements:
hive> SELECT * FROM src; OK xxx
Call an Oracle XQuery for Hadoop function for Hive. This example calls the xml_query function to parse an XML string:
hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
.
.
.
["123"]
If the extensions are accessible, then the query returns ["123"], as shown in the example.
The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:
xml_query: Returns the result of a query as an array of STRING values.
xml_query_as_primitive: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it.
xml_exists: Tests if the result of a query is empty
xml_table: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.
This section describes how you can use the Hive CREATE TABLE statement to create tables over large XML documents.
Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.
To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.
The following is the basic syntax of the Hive CREATE TABLE statement for creating a Hive table over XML files:
CREATE TABLE table_name (columns) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES(configuration)
Parameters
All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions."
Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas.
Note:
Inserting data into XML tables is not supported.Use these configuration properties in the configuration parameter of the CREATE TABLE command.
Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.
This example maps each element named foo in the XML to a single row in the Hive table:
"oxh-elements" = "foo"
The next example maps each element named either foo or bar in the XML to a row in the Hive table:
"oxh-elements" = "foo, bar"
Specifies how an element selected by the oxh-elements property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.
Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.
Any column of the table that does not have a corresponding oxh-column property behaves as if the following property is specified:
"oxh-column.name" = "(./name | ./@name)[1]"
Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."
Defines a namespace binding.
This example binds the prefix myns to the namespace http://example.org:
"oxh-namespace.myns" = "http://example.org"
You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element and oxh-column property values can also reference them.
In the following example, only foo elements in the http://example.org namespace are mapped to table rows:
"oxh-namespace.myns" = "http://example.org", "oxh-elements" = "myns:foo", "oxh-column.bar" = "./myns:bar"
Defines a set of entity reference definitions.
In the following example, entity references in the XML are expanded from &foo; to "foo value" and from &bar; to "bar value".
"oxh-entity.foo" = "foo value" "oxh-entity.bar" = "bar value"
Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.
All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.
This example defines the character set:
"oxh-charset" = "ISO-8859-1"
This section includes the following examples:
This example shows how to map XML elements to column names.
In the following table definition, the oxh-elements property specifies that each element named foo in the XML is mapped to a single row in the table. The oxh-column properties specify that a Hive table column named BAR gets the value of the child element named bar converted to STRING, and the column named ZIP gets the value of the child element named zip converted to INT.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "./bar", "oxh-column.zip" = "./zip" )
In this modified definition of the ZIP column, the column receives a value of -1 if the foo element does not have a child zip element, or if the zip element contains a nonnumeric value:
"oxh-column.zip" = "
if (./zip castable as xs:int) then
xs:int(./zip)
else
-1
"
The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR and ZIP columns.
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo", "oxh-column.bar" = "(./bar | ./@bar)[1]", "oxh-column.zip" = "(./zip | ./@zip)[1]" )
CREATE TABLE example (bar STRING, zip INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "foo" )
These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment element in the document has one or more like elements that indicate that the user liked the comment.
<comments>
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
</comments>
In the CREATE TABLE examples, the comments.xml input file is in the current working directory of the local file system.
The following Hive CREATE TABLE command creates a table named COMMENTS with a row for each comment containing the user names, text, and number of likes:
hive> CREATE TABLE comments (usr STRING, content STRING, likeCt INT) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.usr" = "./@user", "oxh-column.content" = "./@text", "oxh-column.likeCt" = "fn:count(./like)" );
The Hive LOAD DATA command loads comments.xml into the COMMENTS table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments; ]
The following query shows the content of the COMMENTS table.
hive> SELECT usr, content, likeCt FROM comments;
.
.
.
john It is raining :( 0
kelly I won the lottery! 2
mike Happy New Year! 1
This CREATE TABLE command is like Example 1, except that the like elements are produced as XML in a STRING column.
hive>
CREATE TABLE comments2 (usr STRING, content STRING, likes STRING)
ROW FORMAT
SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
"oxh-elements" = "comment",
"oxh-column.usr" = "./@user",
"oxh-column.content" = "./@text",
"oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)"
);
The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
The following query shows the content of the COMMENTS2 table.
hive> SELECT usr, content, likes FROM comments2;
.
.
.
john It is raining :( <likes/>
kelly I won the lottery! <likes><like user="john"/><like user="mike"/></likes>
mike Happy New Year! <likes><like user="laura"/></likes>
The next query extracts the user names from the like elements:
hive> SELECT usr, t.user FROM comments2 LATERAL VIEW > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user; . . . kelly john kelly mike mike laura
This command creates a table named COMMENTS3 with a row for each comment, and produces the XML in a single STRING column.
hive> CREATE TABLE comments3 (xml STRING) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' TBLPROPERTIES( "oxh-elements" = "comment", "oxh-column.xml" = "fn:serialize(.)" );
The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.
hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;
The following query shows the contents of the XML column:
hive> SELECT xml FROM comments3;
.
.
.
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
<like user="john"/>
<like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
<like user="laura"/>
</comment>
The next query extracts the IDs and converts them to integers:
hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
.
.
.
12345
56789
54321
These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node, way, and relation elements.
In these examples, the OpenStreetMap XML files are stored in the /user/name/osm HDFS directory.
See Also:
To download OpenStreetMap data:
For information about the OpenStreetMap XML format:
This example creates a table over OpenStreetMap XML with one row for each node element as follows:
The id, lat, lon, and user attributes of the node element are mapped to table columns.
The year is extracted from the timestamp attribute and mapped to the YEAR column. If a node does not have a timestamp attribute, then -1 is used for the year.
If the node element has any child tag elements, then they are stored as an XML string in the TAGS column. If node has no child tag elements, then column value is NULL.
hive>
CREATE EXTERNAL TABLE nodes (
id BIGINT,
latitude DOUBLE,
longitude DOUBLE,
year SMALLINT,
tags STRING
)
ROW FORMAT
SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
LOCATION '/user/name/osm'
TBLPROPERTIES (
"oxh-elements" = "node",
"oxh-column.id" = "./@id",
"oxh-column.latitude" = "./@lat",
"oxh-column.longitude" = "./@lon",
"oxh-column.year" = "
if (fn:exists(./@timestamp)) then
fn:year-from-dateTime(xs:dateTime(./@timestamp))
else
-1
",
"oxh-column.tags" = "
if (fn:exists(./tag)) then
fn:serialize(<tags>{./tag}</tags>)
else
()
"
);
The following query returns the number of nodes per year:
hive> SELECT year, count(*) FROM nodes GROUP BY year;
This query returns the total number of tags across nodes:
hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
In OpenStreetMap XML, the node, way, and relation elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node, way, and relation element.
See Also:
For a description of the OpenStreetMap elements and attributes:
hive>
CREATE EXTERNAL TABLE osm (
id BIGINT,
uid BIGINT,
type STRING
)
ROW FORMAT
SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
LOCATION '/user/name/osm'
TBLPROPERTIES (
"oxh-elements" = "node, way, relation",
"oxh-column.id" = "./@id",
"oxh-column.uid" = "./@uid",
"oxh-column.type" = "./name()"
);
The following query returns the number of node, way, and relation elements. The TYPE column is set to the name of the selected element, which is either node, way, or relation.
hive> SELECT type, count(*) FROM osm GROUP BY type;
This query returns the number of distinct user IDs:
hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;