7 Oracle XML Extensions for Hive

This chapter explains how to use the XML extensions for Apache Hive provided on Oracle Big Data Appliance with Oracle XQuery for Hadoop. The chapter contains the following sections:

Note:

The features described in this chapter are available only on Oracle Big Data Appliance. You cannot use them on other Hadoop clusters.

7.1 What are the XML Extensions for Apache Hive?

Oracle XQuery for Hadoop provides XML processing support that enables you to do the following:

  • Query large XML files in HDFS as Hive tables

  • Query XML strings in Hive tables

  • Query XML file resources in the Hadoop distributed cache

  • Efficiently extract atomic values from XML without using expensive DOM parsing

  • Retrieve, generate, and transform complex XML elements

  • Generate multiple table rows from a single XML value

  • Manage missing and dirty data in XML

The XML extensions also support these W3C modern standards:

  • XQuery 1.0

  • XQuery Update Facility 1.0 (transform expressions)

  • XPath 2.0

  • XML Schema 1.0

  • XML Namespaces

The XML extensions have two components:

7.2 Using the Hive Extensions

To enable the Oracle XQuery for Hadoop extensions, use the --auxpath and -i arguments when starting Hive:

$ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql

The first time you use the extensions, you might want to verify that they are accessible. The following procedure creates a table named SRC, loads one row into it, and then calls the xml_query function.

The SRC table is needed only to fulfill a SELECT syntax requirement. It is like the DUAL table in Oracle Database, which is referenced in SELECT statements to test SQL functions.

To verify that the extensions are accessible: 

  1. Log in to an Oracle Big Data Appliance server where you plan to work.

  2. Create a text file named src.txt that contains one line:

    $ echo "XXX" > src.txt
    
  3. Start the Hive command-line interface (CLI):

    $ hive --auxpath $OXH_HOME/hive/lib -i $OXH_HOME/hive/init.sql
    

    The init.sql file contains the CREATE TEMPORARY FUNCTION statements that declare the XML functions.

  4. Create a simple table and load data from src.txt:

    hive> CREATE TABLE src(dummy STRING);
    hive> LOAD DATA LOCAL INPATH 'src.txt' OVERWRITE INTO TABLE src;
    
  5. Call an Oracle XQuery for Hadoop function for Hive:

    hive> SELECT xml_query("x/y", "<x><y>123</y><z>456</z></x>") FROM src;
    

If the extensions are accessible, then the query returns ["123"].

7.3 Creating XML Tables

This section describes how you can use the Hive CREATE TABLE statement to create tables over large XML documents.

Hive queries over XML tables scale well, because Oracle XQuery for Hadoop splits up the XML so that the MapReduce framework can process it in parallel.

To support scalable processing and operate in the MapReduce framework, the table adapter scans for elements to use to create table rows. It parses only the elements that it identifies as being part of the table; the rest of the XML is ignored. Thus, the XML table adapter does not perform a true parse of the entire XML document, which imposes limitations on the input XML. Because of these limitations, you can create tables only over XML documents that meet the constraints listed in "XQuery Transformation Requirements." Otherwise, you may get errors or incorrect results.

7.3.1 Hive CREATE TABLE Syntax for XML Tables

The following is the basic syntax to use when creating a Hive table over XML files:

CREATE TABLE table_name (columns)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(configuration)

columns: All column types in an XML table must be one of the Hive primitive types given in "About Data Type Conversions."

configuration: Any of the properties described in "CREATE TABLE Configuration Properties."

Note:

Inserting into XML tables is not supported.

CREATE TABLE Configuration Properties 

oxh-elements

Identifies the names of elements in the XML that map to rows in the table, in a comma-delimited list. This property must be specified one time. Required.

This example maps each element named foo in the XML to a single row in the Hive table:

"oxh-elements" = "foo"

The next example maps each element named either foo or bar in the XML to a row in the Hive table:

"oxh-elements" = "foo, bar"
oxh-column.name

Specifies how an element selected by the oxh-elements property is mapped to columns in a row. In this property name, replace name with the name of a column in the table. The value can be any XQuery expression. The initial context item of the expression (the "." variable) is bound to the selected element.

Be sure to check log files even when a query executes successfully. If a column expression returns no value or raises a dynamic error, the column value is NULL. The first time an error occurs, it is logged and query processing continues. Subsequent errors raised by the same column expression are not logged.

In the following example, the oxh-elements property specifies that each element named foo in the XML is mapped to a single row in the table. The oxh-column properties specify that the Hive table column named BAR gets the value of the child element named bar converted to STRING, and the column named ZIP gets the value of the child element named zip converted to INT.

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "./bat",
   "oxh-column.zip" = "./zip"
)

In this modified definition of the zip column, the column receives a value of -1 if the foo element does not have a child zip element, or the zip element contains a nonnumeric value.

"oxh-column.zip" = "
   if (./zip castable as xs:int) then 
      xs:int(./zip) 
   else 
      -1
"

Any column of the table that does not have a corresponding oxh-column property behaves as if the following property is specified:

"oxh-column.name" = "(./name | ./@name)[1]"

Thus, the default behavior is to select the first child element or attribute that matches the table column name.

The following two table definitions are equivalent:

Table Definition 1:

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo", 
   "oxh-column.bar" = "(./bar | ./@bar)[1]",
   "oxh-column.zip" = "(./zip | ./@zip)[1]"
)

Table Definition 2:

CREATE TABLE example (bar STRING, zip INT)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "foo"
)
oxh-namespace.prefix

Defines a namespace binding.

This example binds the prefix myns to the namespace http://example.org:

"oxh-namespace.myns" = "http://example.org"

You can use this property multiple times to define additional namespaces. The namespace definitions are used when parsing the XML. The oxh-element and oxh-column property values can also reference them.

In the following examples, only foo elements in the http://example.org namespace are mapped to table rows:

"oxh-namespace.myns" = "http://example.org",
"oxh-elements" = "myns:foo",
"oxh-column.bar" = "./myns:bar"
oxh-entity.name

Defines a set of entity reference definitions.

In the following example, entity references in the XML are expanded from &foo; to "foo value" and from &bar; to "bar value".

"oxh-entity.foo" = "foo value"
"oxh-entity.bar" = "bar value"
oxh-charset

Specifies the character encoding of the XML files. The supported encodings are UTF-8 (default), ISO-8859-1, and US-ASCII.

This example sets the character set:

"oxh-charset" = "ISO-8859-1"

7.3.2 CREATE TABLE Examples

This section includes some examples using a simple data set, and other examples using a very large and complex real-world data set:

7.3.2.1 Simple Examples

These examples show how to create Hive tables over a small XML document that contains comments posted by users of a fictitious website. Each comment element in the document has one or more like elements that indicate that the user liked the comment.

<comments>
   <comment id="12345" user="john" text="It is raining :( "/>
   <comment id="56789" user="kelly" text="I won the lottery!">
      <like user="john"/>
      <like user="mike"/>
   </comment>
   <comment id="54321" user="mike" text="Happy New Year!">
      <like user="laura"/>
   </comment>
</comments>

In the CREATE TABLE examples, the comments.xml input file is in the current working directory of the local file system.

Example 1   Creating a Table

This example creates a table named COMMENTS with a row for each comment, with the user, number of likes, and text.

hive> 
  CREATE TABLE comments (usr STRING, content STRING, likeCt INT)
  ROW FORMAT
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  TBLPROPERTIES(
    "oxh-elements" = "comment",
    "oxh-column.usr" = "./@user",
    "oxh-column.content" = "./@text",
    "oxh-column.likeCt" = "fn:count(./like)"
  );

The Hive LOAD DATA command loads comments.xml into the COMMENTS table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments;

The following query shows the content of the COMMENTS table.

hive> SELECT usr, content, likeCt FROM comments;
 
john  It is raining :(     0
kelly I won the lottery!   2
mike  Happy New Year!      1
Example 2   Querying an XML Column

This example is like Example 1, except that the like elements are produced as XML in a STRING column.

hive>
  CREATE TABLE comments2 (usr STRING, content STRING, likes STRING)
  ROW FORMAT
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  TBLPROPERTIES(
    "oxh-elements" = "comment",
    "oxh-column.usr" = "./@user",
    "oxh-column.content" = "./@text",
    "oxh-column.likes" = "fn:serialize(<likes>{./like}</likes>)"
  );
 

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments2;
 

The following query shows the content of the COMMENTS2 table.

hive> SELECT usr, content, likes FROM comments2;
 
john   It is raining :(    <likes/>
kelly  I won the lottery!  <likes><like user="john"/><like user="mike"/></likes>
mike   Happy New Year!     <likes><like user="laura"/></likes>
 

The next query extracts the user names from the like elements:

hive>
SELECT usr, t.user
   FROM comments2
   LATERAL VIEW 
      xml_table("likes/like", comments2.likes, struct("./@user")) t AS user;
 
kelly  john
kelly  mike
mike   laura
Example 3   Generating XML in a Single String Column

This example creates a table named COMMENTS3 with a row for each comment, and produces the XML in a single STRING column.

hive>
CREATE TABLE comments3 (xml STRING)
ROW FORMAT
   SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
STORED AS
   INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
   OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
TBLPROPERTIES(
   "oxh-elements" = "comment",
   "oxh-column.xml" = "fn:serialize(.)"
   ); 

The Hive LOAD DATA command loads comments.xml into the table. See "Simple Examples" for the contents of the file.

hive> LOAD DATA LOCAL INPATH 'comments.xml' OVERWRITE INTO TABLE comments3;

The following query shows the contents of the XML column:

hive> SELECT xml FROM comments3;
 
  <comment id="12345" user="john" text="It is raining :( "/>
  <comment id="56789" user="kelly" text="I won the lottery!">
     <like user="john"/>
     <like user="mike"/>
  </comment>
  <comment id="54321" user="mike" text="Happy New Year!">
     <like user="laura"/>
  </comment>
 

The next query extracts the IDs and converts them to integers:

hive> SELECT xml_query_as_int("comment/@id", xml) FROM comments3;
 
12345
56789
54321

7.3.2.2 Detailed Examples

These examples use data from OpenStreetMap, which provides free map data for the entire world. You can export the data as XML for specific geographic regions or the entire planet. An OpenStreetMap XML document mainly contains a sequence of node, way, and relation elements.

In these examples, the OpenStreetMap XML files are stored in the /user/name/osm HDFS directory.

See Also:

Example 1   Creating a Table Over OpenStreetMap XML

This example creates a table over OpenStreetMap XML with one row for each node element as follows:

  • The id, lat, lon, and user attributes of the node element are mapped to table columns.

  • The year is extracted from the timestamp attribute and mapped to the YEAR column. If a node does not have a timestamp attribute, then -1 is used for the year.

  • If the node element has any child tag elements, then they are stored as an XML string in the TAGS column. If node has no child tag elements, then column value is NULL.

hive>
  CREATE EXTERNAL TABLE nodes (
     id BIGINT,
     latitude DOUBLE,
     longitude DOUBLE,
     year SMALLINT,
     tags STRING
  ) 
  ROW FORMAT 
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  LOCATION '/user/name/osm'
  TBLPROPERTIES (
    "oxh-elements" = "node",
    "oxh-column.id" = "./@id",
    "oxh-column.latitude" = "./@lat",
    "oxh-column.longitude" = "./@lon",
    "oxh-column.year" = "
         if (fn:exists(./@timestamp)) then
            fn:year-from-dateTime(xs:dateTime(./@timestamp))
         else
            -1
      ",
    "oxh-column.tags" = "
         if (fn:exists(./tag)) then
           fn:serialize(<tags>{./tag}</tags>)
         else 
           ()
      "
  );
 

The following query returns the number of nodes per year:

hive>  SELECT year, count(*) FROM nodes GROUP BY year;

This query returns the total number of tags across nodes:

hive>
  SELECT sum(xml_query_as_int("count(tags/tag)", tags))
  FROM nodes;
 
Example 2   

In OpenStreetMap XML, the node, way, and relation elements share a set of common attributes such as the user who contributed the data. The next table produces one row for each node, way, and relation element.

See Also:

For a description of the OpenStreetMap elements and attributes:

http://wiki.openstreetmap.org/wiki/Elements

hive>
  CREATE EXTERNAL TABLE osm (
     id BIGINT,
     uid BIGINT,
     type STRING
  ) 
  ROW FORMAT 
    SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe'
  STORED AS
    INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat'
    OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat'
  LOCATION '/user/name/osm'
  TBLPROPERTIES (
    "oxh-elements" = "node, way, relation",
    "oxh-column.id" = "./@id",
    "oxh-column.uid" = "./@uid",
    "oxh-column.type" = "./name()"
  );

The following query returns the number of node, way, and relation elements. The TYPE column is set to the name of the selected element, which is either node, way, or relation.

hive>
  SELECT type, count(*) 
  FROM osm
  GROUP BY type;
 

This query returns the number of distinct user IDs:

hive>
  SELECT count(*) 
  FROM (  
    SELECT uid
    FROM osm
    GROUP BY uid
  ) t;