7 Oracle XML Extensions for Hive

This chapter explains how to use the XML extensions for Apache Hive provided on Oracle Big Data Appliance with Oracle XQuery for Hadoop. The chapter contains the following sections:

Note:

The features described in this chapter are available only on Oracle Big Data Appliance. They are not included in the downloadable Oracle XQuery for Hadoop installation package.

7.1 What are the XML Extensions for Hive?

The XML Extensions for Hive provide XML processing support that enables you to do the following:

  • Query large XML files in HDFS as Hive tables

  • Query XML strings in Hive tables

  • Query XML file resources in the Hadoop distributed cache

  • Efficiently extract atomic values from XML without using expensive DOM parsing

  • Retrieve, generate, and transform complex XML elements

  • Generate multiple table rows from a single XML value

  • Manage missing and dirty data in XML

The XML extensions also support these W3C modern standards:

  • XQuery 1.0

  • XQuery Update Facility 1.0 (transform expressions)

  • XPath 2.0

  • XML Schema 1.0

  • XML Namespaces

The XML extensions have two components:

7.2 Using the Hive Extensions

To enable the Oracle XQuery for Hadoop extensions, use the --auxpath and -i arguments when starting Hive:

$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql

The first time you use the extensions, verify that they are accessible. The following procedure creates a table named SRC, loads one row into it, and calls the xml_query function.

To verify that the extensions are accessible: 

  1. Log in to an Oracle Big Data Appliance server where you plan to work.

  2. Create a text file named src.txt that contains one line:

    $ echo "XXX" > src.txt
    
  3. Start the Hive command-line interface (CLI):

    $ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
    

    The init.sql file contains the CREATE TEMPORARY FUNCTION statements that declare the XML functions.

  4. Create a simple table:

    hive> CREATE TABLE src(dummy STRING);
    

    The SRC table is needed only to fulfill a SELECT syntax requirement. It is like the DUAL table in Oracle Database, which is referenced in SELECT statements to test SQL functions.

  5. Load data from src.txt into the table:

    hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
    
  6. Query the table using Hive SELECT statements:

    hive> SELECT * FROM src;
    OK
    xxx
    
  7. Call an Oracle XQuery for Hadoop function for Hive. This example calls the xml_query function to parse an XML string:

    hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
         .
         .
         .
    ["123"]
    

If the extensions are accessible, then the query returns ["123"], as shown in the example.

7.3 About the Hive Functions

The Oracle XQuery for Hadoop extensions enable you to query XML strings in Hive tables and XML file resources in the Hadoop distributed cache. These are the functions:

  • xml_query: Returns the result of a query as an array of STRING values.

  • xml_query_as_primitive: Returns the result of a query as a Hive primitive value. Each Hive primitive data type has a separate function named for it.

  • xml_exists: Tests if the result of a query is empty

  • xml_table: Maps an XML value to zero or more table rows, and enables nested repeating elements in XML to be mapped to Hive table rows.

See "Oracle XML Functions for Hive Reference."

7.4 Creating XML Tables

This section describes how you can use the Hive CREATE TABLE statement to create tables over large XML documents.

Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.

To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you might get errors or incorrect results.

7.4.1 Hive CREATE TABLE Syntax for XML Tables

The following is the basic syntax of the Hive CREATE TABLE statement for creating a Hive table over XML files:

CREATE TABLE table_name (columns)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(configuration)

Parameters 

columns

All column types in an XML table must be one of the Hive primitive types given in "Data Type Conversions."

configuration

Any of the properties described in "CREATE TABLE Configuration Properties." Separate multiple properties with commas.

Note:

Inserting data into XML tables is not supported.

7.4.2 CREATE TABLE Configuration Properties

Use these configuration properties in the configuration parameter of the CREATE TABLE command.

oxh-elements

Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.

This example maps each element named foo in the XML to a single row in the Hive table:

"oxh-elements" = "foo"

The next example maps each element named either foo or bar in the XML to a row in the Hive table:

"oxh-elements" = "foo, bar"
oxh-column.name

Specifies how an element selected by the oxh-elements property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.

Check the log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.

Any column of the table that does not have a corresponding oxh-column property behaves as if the following property is specified:

"oxh-column.name" = "(./name | ./@name)[1]"

Thus, the default behavior is to select the first child element or attribute that matches the table column name. See "Syntax Example."

oxh-namespace.prefix

Defines a namespace binding.

This example binds the prefix myns to the namespace http://example.org:

"oxh-namespace.myns" = "http://example.org"

You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element and oxh-column property values can also reference them.

In the following example, only foo elements in the http://example.org namespace are mapped to table rows:

"oxh-namespace.myns" = "http://example.org",
"oxh-elements" = "myns:foo",
"oxh-column.bar" = "./myns:bar"
oxh-entity.name

Defines a set of entity reference definitions.

In the following example, entity references in the XML are expanded from &foo; to "foo value" and from &bar; to "bar value".

"oxh-entity.foo" = "foo value"
"oxh-entity.bar" = "bar value"
oxh-charset

Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.

All XML files for the table must share the same character encoding. Any encoding declarations in the XML files are ignored.

This example defines the character set:

"oxh-charset" = "ISO-8859-1"

7.4.3 CREATE TABLE Examples

This section includes the following examples:

7.4.3.1 Syntax Example

This example shows how to map XML elements to column names.

Example 1   Basic Column Mappings

In the following table definition, the oxh-elements property specifies that each element named foo in the XML is mapped to a single row in the table. The oxh-column properties specify that a Hive table column named BAR gets the value of the child element named bar converted to STRING, and the column named ZIP gets the value of the child element named zip converted to INT.

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "./bat",
   "oxh-column.zip" = "./zip"
)
Example 2   Conditional Column Mappings

In this modified definition of the ZIP column, the column receives a value of -1 if the foo element does not have a child zip element, or if the zip element contains a nonnumeric value:

"oxh-column.zip" = "
   if (./zip castable as xs:int) then 
      xs:int(./zip) 
   else 
      -1
"
Example 3   Default Column Mappings

The following two table definitions are equivalent. Table Definition 2 relies on the default mappings for the BAR and ZIP columns.

Table Definition 1

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "(./bar | ./@bar)[1]",
   "oxh-column.zip" = "(./zip | ./@zip)[1]"
)

Table Definition 2

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo"
)

7.4.3.2 Simple Examples

These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment element in the document has one or more like elements that indicate that the user liked the comment.

<comments>
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>

In the CREATE TABLE examples, the comments.xml input file is in the current working directory of the local file system.

Example 1   Creating a Table

The following Hive CREATE TABLE command creates a table named COMMENTS with a row for each comment containing the user names, text, and number of likes:

hive>
CREATE TABLE comments (usr STRING, content STRING, likeCt INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.usr" = "./@user",
   "oxh-column.content" = "./@text",
   "oxh-column.likeCt" = "fn:count(./like)"
);

The Hive LOAD DATA command loads comments.xml into the COMMENTS table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments;
]

The following query shows the content of the COMMENTS table.

hive> SELECT usr, content, likeCt FROM comments;
     .
     .
     .
john  It is raining :(     0
kelly I won the lottery!   2
mike  Happy New Year!      1
Example 2   Querying an XML Column

This CREATE TABLE command is like Example 1, except that the like elements are produced as XML in a STRING column.

hive>
CREATE TABLE comments2 (usr STRING, content STRING, likes STRING)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.usr" = "./@user",
   "oxh-column.content" = "./@text",
   "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)"
);

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
 

The following query shows the content of the COMMENTS2 table.

hive> SELECT usr, content, likes FROM comments2;
     .
     .
     .
john   It is raining :(    <likes/>
kelly  I won the lottery!  <likes><like user="john"/><like user="mike"/></likes>
mike   Happy New Year!     <likes><like user="laura"/></likes>
 

The next query extracts the user names from the like elements:

hive> SELECT usr, t.user FROM comments2 LATERAL VIEW 
    > xml_table("likes/like", comments2.likes, struct("./@user")) t AS user;
     .
     .
     .
kelly  john
kelly  mike
mike   laura
Example 3   Generating XML in a Single String Column

This command creates a table named COMMENTS3 with a row for each comment, and produces the XML in a single STRING column.

hive> 
CREATE TABLE comments3 (xml STRING)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.xml" = "fn:serialize(.)"
   ); 

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;

The following query shows the contents of the XML column:

hive> SELECT xml FROM comments3;
     .
     .
     .
<comment id="12345" user="john" text="It is raining :( "/>
<comment id="56789" user="kelly" text="I won the lottery!">
   <like user="john"/>
   <like user="mike"/>
</comment>
<comment id="54321" user="mike" text="Happy New Year!">
   <like user="laura"/>
</comment>
 

The next query extracts the IDs and converts them to integers:

hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
     .
     .
     .
12345
56789
54321

7.4.3.3 OpenStreetMap Examples

These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node, way, and relation elements.

In these examples, the OpenStreetMap XML files are stored in the /user/name/osm HDFS directory.

See Also:

Example 1   Creating a Table Over OpenStreetMap XML

This example creates a table over OpenStreetMap XML with one row for each node element as follows:

  • The id, lat, lon, and user attributes of the node element are mapped to table columns.

  • The year is extracted from the timestamp attribute and mapped to the YEAR column. If a node does not have a timestamp attribute, then -1 is used for the year.

  • If the node element has any child tag elements, then they are stored as an XML string in the TAGS column. If node has no child tag elements, then column value is NULL.

hive>
CREATE EXTERNAL TABLE nodes (
   id BIGINT,
   latitude DOUBLE,
   longitude DOUBLE,
   year SMALLINT,
   tags STRING
) 
ROW FORMAT 
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
LOCATION '/user/name/osm'
TBLPROPERTIES (
   "oxh-elements" = "node",
   "oxh-column.id" = "./@id",
   "oxh-column.latitude" = "./@lat",
   "oxh-column.longitude" = "./@lon",
   "oxh-column.year" = "
      if (fn:exists(./@timestamp)) then
         fn:year-from-dateTime(xs:dateTime(./@timestamp))
      else
         -1
   ",
   "oxh-column.tags" = "
      if (fn:exists(./tag)) then
         fn:serialize(<tags>{./tag}</tags>)
      else 
         ()
   "
);

The following query returns the number of nodes per year:

hive> SELECT year, count(*) FROM nodes GROUP BY year;

This query returns the total number of tags across nodes:

hive> SELECT sum(xml_query_as_int("count(tags/tag)", tags)) FROM nodes;
Example 2   

In OpenStreetMap XML, the node, way, and relation elements share a set of common attributes, such as the user who contributed the data. The next table produces one row for each node, way, and relation element.

See Also:

For a description of the OpenStreetMap elements and attributes:

http://wiki.openstreetmap.org/wiki/Elements

hive>
  CREATE EXTERNAL TABLE osm (
     id BIGINT,
     uid BIGINT,
     type STRING
  ) 
  ROW FORMAT 
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  LOCATION '/user/name/osm'
  TBLPROPERTIES (
    "oxh-elements" = "node, way, relation",
    "oxh-column.id" = "./@id",
    "oxh-column.uid" = "./@uid",
    "oxh-column.type" = "./name()"
  );

The following query returns the number of node, way, and relation elements. The TYPE column is set to the name of the selected element, which is either node, way, or relation.

hive> SELECT type, count(*) FROM osm GROUP BY type;
 

This query returns the number of distinct user IDs:

hive> SELECT count(*) FROM (SELECT uid FROM osm GROUP BY uid) t;