9 Oracle DataSource for Apache Hadoop (OD4H)

Oracle DataSource for Apache Hadoop (OD4H) allows direct, fast, parallel, secure and consistent access to master data in Oracle Database using Spark SQL via Hive metastore. This chapter discusses Oracle DataSource for Apache Hadoop (OD4H) in the following sections:

9.1 Operational Data, Big Data and Requirements

The common data architecture in most companies nowadays generally comprises of the following components:

Oracle Database(s) for operational, transactional, and master data, that is shared business object such as customers, products, employees and so on
Big Data

Hadoop applications such as Master Data Management (MDM), Events processing, and others, need access to data in both Hadoop storages (such as HDFS and NoSQL Database as a landing point for weblogs, and so on) and Oracle Database (as the reliable and auditable source of truth). There are two approaches to process such data that reside in both Hadoop storage and Oracle Database:

ETL Copy using tools such as Oracle's Copy to BDA
Direct Access using Oracle Big Data SQL and Oracle DataSource for Apache Hadoop (OD4H).

In this chapter, we will discuss Oracle DataSource for Apache Hadoop (OD4H).

9.2 Overview of Oracle DataSource for Apache Hadoop (OD4H)

Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.

This section discusses the following concepts:

9.2.1 Opportunity with Hadoop 2.x

Hadoop 2.x architecture decouples compute engines from cluster resources management and storages. It enables:

A variety of SQL query engines. For instance, Hive SQL, Spark SQL, Big Data SQL, and so on.
A variety of programmatic compute engines. For instance, MapReduce, Pig, Storm, Solr, Cascading, and so on.
Elastic allocation of compute resources (CPU, memory) through YARN.
A variety of data stores such as HDFS, NoSQL, as well as remote storages through HCatalog, InputFormat, OutputFormat and StorageHandler interfaces.

Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.

Following is an illustration of Hadoop 2.0 Architecture:

Figure 9-1 Hadoop 2.0 Architecture

Description of "Figure 9-1 Hadoop 2.0 Architecture"

9.2.2 Oracle Tables as Hadoop Data Source

OD4H enables current and ad-hoc querying. This makes querying data faster and more secure. You can query data directly and retrieve only the data that you need, when you need it.

OD4H also provides Oracle’s end-to-end security. This includes Identity Management, Column Masking, and Label and Row Security.

OD4H also furnishes direct access for Hadoop and Spark APIs such as Pig, MapReduce and others.

9.2.3 External Tables

External Tables turn Oracle tables into Hadoop and/or Spark datasources. The DDL for declaring External Tables is as follows:

CREATE[TEMPORARY] EXTERNAL TABLE [IF NOT EXISTS]  [db_name.]table_name
[(col_name data_type [COMMENTcol_comment],...)]
[COMMENT table_comment]
STORED BY 'oracle.hcat.osh.OracleStorageHandler' [WITHSERDEPROPERTIES(...)]
[TBLPROPERTIES (property_name=property_value,...)]
 
data_type
|SMALLINT
|INT
|BIGINT
|BOOLEAN
|FLOAT
|DOUBLE
|STRING
|BINARY
|TIMESTAMP
|DECIMAL
|DECIMAL(precision,scale)
|VARCHAR
|CHAR

9.2.3.1 TBLPROPERTIES

Property	Use
oracle.hcat.osh.columns.mapping	Comma separated list to specify mapping between Hive columns and Oracle table columns. All external tables using OracleStorageHandler must define this.
mapreduce.jdbc.url	Connection URL to connect to the database
mapreduce.jdbc.username	Connection user name to connect to the database
mapreduce.jdbc.password	Connection password to connect to the database
mapreduce.jdbc.input.table.name	Oracle table name
mapreduce.jdbc.input conditions	To be used for querying the database. Must be used for query pushdown.
mapreduce.jdbc.input.query	To be used for querying the database. Query should be used only when a subset of the columns is selected.
mapreduce.jdbc.input.orderby	`ORDER BY` clause to be specified for pushing ordering to the database.
oracle.hcat.osh.splitterKind	To be used to specify how OracleStorageHandler must create splits, so that they are a good match for the physical structure of the target table in Oracle Database. The splitter kind applicable could be `SINGLE_SPLITTER`, `PARTITION_SPLITTER`, `ROW_SPLITTER`, `BLOCK_SPLITTER`.
oracle.hcat.osh.rowsPerSplit	Used only when `ROW_SPLITTER` splitterKind is applied on the table. Represents Number of rows per split for `LIMIT_RANGE` splitter. Default is 1000
oracle.hcat.osh.authentication	Authentication method used to connect to Oracle Database. Can be `SIMPLE` (default), `ORACLE_WALLET`, `KERBEROS`
sun.security.krb5.principal	Kerberos principal. Used only when `KERBEROS` authentication is applied.
oracle.hcat.osh.kerb.callback	Callback for Kerberos authentication. Used only when Kerberos authentication is applied.
oracle.hcat.osh.maxSplits	Maximum number of splits for any splitter kind
oracle.hcat.osh.useChunkSplitter	Use chunk based ROW_SPLITTER and BLOCK_SPLITTER that use DBMS_PARALLEL_EXECUTE package to divide table into chunks that will map to hadoop splits.The default value is set to ‘true’.
oracle.hcat.osh.chunkSQL	Used by `CUSTOM_SPLITTER`to create splits. The SQL string should be a SELECT statement that returns range of each chunk must have two columns: `start_id` and `end_id` The columns must be of `ROWID` type.
oracle.hcat.osh.useOracleParallelism	When configured, parallel queries will be executed while fetching rows from Oracle. Default value: ‘`false`’
oracle.hcat.osh.fetchSize	JDBC fetchsize for generated select queries used to fetch rows. Default value: 10 (set by Oracle JDBC Driver)

Note:

In addition to the above, any JDBC connection properties (oracle.jdbc.* and oracle.net.*) can be specified as TBLPROPERTIES. They will be used while establishing connection to Oracle Database using JDBC driver.

Note:

Oracle DataSource for Apache Hadoop (OD4H) works with Oracle View and Oracle Tables.

9.2.3.2 SERDE PROPERTIES

Property Use

Property	Use
oracle.hcat.osh.columns.mapping	All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. `WITHSERDEPROPERTIES` also enables the external table definition to refer only to select columns in the actual Oracle table. In other words, not all columns from the Oracle table need to be part of the Hive external table. The ordering of oracle columns in the mapping is the same as ordering of hive columns specified in create table.

oracle.hcat.osh.columns.mapping

All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. WITHSERDEPROPERTIES also enables the external table definition to refer only to select columns in the actual Oracle table. In other words, not all columns from the Oracle table need to be part of the Hive external table. The ordering of oracle columns in the mapping is the same as ordering of hive columns specified in create table.

9.2.4 List of jars in the OD4H package

Oracle DataSource for Apache Hadoop (OD4H) contains the following list of jars.

OD4H consists of the following list of jars.

Table 9-1 List of jars in OD4H

Name of JAR	Use
osh.jar	Contains OracleStorageHandler Implementation
ojdbc7.jar	An OD4H specific JDBC driver (which is optimized with internal calls), used by Spark or Hadoop tasks to connect to the database.
ucp.jar	For creating connection pools in OracleStorageHandler
oraclepki103.jar, osdt_core.jar, osdt_cert.jar, osdt_jce.jar	For Oracle Wallet authentication
orai18n.jar	Oracle Globalization Support
xdb.jar	Oracle XDB jar

9.3 How does OD4H work?

Oracle DataSource for Apache Hadoop (OD4H) does not require creating a new table. You can start working with OD4H using the following steps:

Create a new Oracle table, or, reuse an existing table.
Create the Hive DDL for creating the external table referencing the Oracle Table.
Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.

The following sections show how to create a new Oracle Database Table, and a Hive DDL:

9.3.1 Create a new Oracle Database Table or Reuse an Existing Table

Here is an illustration of a partitioned Oracle table that we will use to demo how partition pruning works:

CREATE TABLE EmployeeData ( Emp_ID NUMBER,
    First_Name VARCHAR2(20),
    Last_Name VARCHAR2(20),
    Job_Title VARCHAR2(40),
    Salary NUMBER)
PARTITION BY RANGE (Salary)
 ( PARTITION salary_1 VALUES LESS THAN (60000)
    TABLESPACE tsa
 , PARTITION salary_2 VALUES LESS THAN (70000)
    TABLESPACE tsb
 , PARTITION salary_3 VALUES LESS THAN (80000)
    TABLESPACE tsc
 , PARTITION salary_4 VALUES LESS THAN (90000)
    TABLESPACE tsd
 , PARTITION salary_5 VALUES LESS THAN (100000)
    TABLESPACE tse
 );

Note:

You can use this syntax for table creation, in the following examples listed in this Book.

Issue queries from Hive, Spark, or any other Hadoop models (including joins with local Hive Tables.)

9.3.2 Hive DDL

In this example, we will associate two Hive external tables to the same Oracle table, using two different split patterns:

SIMPLE_SPLITTER
PARTITION_SPLITTER

Note:

It is possible that the external table has fewer columns than the base Oracle table.

Since columns can have different names, use TBLPROPERTY for mapping with the base table.

In the following examples, we are using the following variables:

connection_string = jdbc:oracle:thin:@localhost:1521/<servicename>

oracle_user=od4h

oracle_pwd=od4h

The following command creates a Hive external table with the default split pattern, that is SIMPLE_SPLITTER.

CREATE EXTERNAL TABLE EmployeeDataSimple (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:od4h}',
 'mapreduce.jdbc.password' = '${hiveconf:od4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData'
);

The following example creates a Hive external table using PARTITION_SPLITTER.

DROP TABLE EmployeeDataPartitioned;
CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:od4h}',
 'mapreduce.jdbc.password' = '${hiveconf:od4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.splitterKind' = 'PARTITIONED_TABLE'
);

9.3.3 Creating External Tables in Hive

You can create an external table in Hive in the following way:

DROP TABLE employees;
 
CREATE EXTERNAL TABLE employees (
  EMPLOYEE_ID INT,
  FIRST_NAME  STRING,
  LAST_NAME   STRING,
  SALARY      DOUBLE,
  HIRE_DATE   TIMESTAMP,
  JOB_ID      STRING
 )
 
  STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 
WITH SERDEPROPERTIES (
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')
 
  TBLPROPERTIES (
    'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',
    'mapreduce.jdbc.username' = 'hr',
    'mapreduce.jdbc.password' = 'hr',
    'mapreduce.jdbc.input.table.name' = 'EMPLOYEES'
);

Note:

Ensure that ucp.jar, ojdbc8.jar and osh.jar are present in the Hive CLASSPATH, for using OD4H. This is pre-configured on BDA. .

To learn more about CLASSPATH and other Hive configuration properties, refer the following sources:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

9.4 Features of OD4H

The following topics discuss features of OD4H.

9.4.1 Performance And Scalability Features

Following sections discuss the performance and scalability features of OD4H:

HCatalog stores table metadata from Hive DDL. HiveSQl, Spark SQL and others, then use this metadata while submitting queries.

The Oracle table is divided into granules determined by the splitterKind property. These granules are then read into a split by OracleStorageHandler, by submitting generated queries.

OracleStorageHandler will not have to test all possible query types if the query plan determines which splits need to be scanned.

Figure 9-2 OD4H in a Nutshell

Description of "Figure 9-2 OD4H in a Nutshell"

9.4.1.1 Splitters

While executing a query on a Hive external table through OD4H, the underlying Oracle table is dynamically divided into granules, which correspond to splits on the Hadoop side. Each split is processed by a single map task. With the help of the ORACLE_SPLITTER_KIND property, you can specify how the splits are created. This ensures that the splits are a good match for the physical structure of the target table in Oracle Database.

The different kinds of splitters available are:

SINGLE_SPLITTER

Creates one split for the table. Use SINGLE_SPLITTER where a single task is sufficient to process the query against the entire table.

ROW_SPLITTER

Limits the number of rows per Split. The default number of rows is 1000. You can specify number of rows by setting the oracle.hcat.osh.rowsPerSplit property. The default value of oracle.hcat.osh.maxSplits is 1 when ROW_SPLITTER is used. You can increase this value to enable parallel reads.

Based on the values provided in the rowsPerSplit property, OD4H will divide tables into splits. If the number of splits obtained is higher than the maxSplits, then maxSplits property will be used. The rows per split will be divided accordingly.

Note:

oracle.hcat.osh.rowsPerSplitis used only by ROW_SPLITTER and not any other splitter kind.

BLOCK_SPLITTER

Creates splits based on underlying storage of data blocks. With Block Splitter, you can specify the maximum number of splits to be generated. The default value of oracle.hcat.osh.maxSplits is 1, when BLOCK_SPLITTER is used. You can increase this value to enable parallel reads. BLOCK_SPLITTER requires SELECT privilege on the SYS.DBA.EXTENTS table, granted to the schema containing the Oracle target table. In the event that this permission does not exist, OD4H will use SINGLE_SPLITTER.

Note:

The actual number of splits under BLOCK_SPLITTER may be lesser than the value specified in the oracle.hcat.osh.maxSplits property.

Do not use BLOCK_SPLITTER on partitioned tables or Index Organized tables.

Note:

For ROW_SPLITTER and BLOCK_SPLITTER types, use oracle.hcat.osh.useChunkSplitter to specify splitting mechanism. The default property value is true. This enables creating chunks corresponding to splits using the DBMS_PARALLEL_EXECUTE package. When the property value is false, custom SQL is generated for splitting. Since DBMS_PARALLEL_EXECUTE can only be used for tables and not views, if mapreduce.jdbc.input.table.name points to a view and not a table, then oracle.hcat.osh.useChunkSplitter should be set to false.

PARTITION_SPLITTER

Creates one split per partition. PARTITION_SPLITTER is used by default when the table is partitioned. You can override this setting by specifying ROW_SPLITTER in table properties. With PARTITION_SPLITTER, the default value of oracle.hcat.osh.maxSplits table property is 64.

Following is an illustration of ROW_SPLITTER:

DROP TABLE employees;
 
CREATE EXTERNAL TABLE employees (
  EMPLOYEE_ID INT,
  FIRST_NAME  STRING,
  LAST_NAME   STRING,
  SALARY      DOUBLE,
  HIRE_DATE   TIMESTAMP,
  JOB_ID      STRING
 )
 STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 
WITH SERDEPROPERTIES (
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')

TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',     
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES',     
'oracle.hcat.osh.splitterKind' = 'ROW_SPLITTER',     
'oracle.hcat.osh.rowsPerSplit' = '1500' 
);

CUSTOM_SPLITTER

Use CUSTOM_SPLITTER If you want to provide a custom split generation mechanism. You can do this using CUSTOM_SPLITTER through oracle.hcat.osh.splitterKind property and a SELECT statement that emits ROWIDs corresponding to start and end of each split in oracle.hcat.osh.chunkSQL.

9.4.1.2 Choosing a Splitter

SINGLE_SPLITTER is used by default if no splitter is specified in the table properties for Hive external table, and the target Oracle table is not partitioned.

For an unpartitioned table, the default value of oracle.hcat.osh.maxSplits will be 1. For partitioned table, the default value of the same will be 64, and the default splitter will be PARTITION_SPLITTER. The default for maxSplits is set to limit the number of connections to the Oracle server. To increase this limit, you must increase the value of oracle.hcat.osh.maxSplits explicitly in hive table properties.

Use the following guidelines while choosing a splitter kind for a hive external table:

Splitter Kind	Use
`SINGLE_SPLITTER`	When no parallelism is required.
`PARTITION_SPLITTER`	Used by default when target table is partitioned
`BLOCK_SPLITTER`	When Oracle user has `SELECT` privilege on `SYS.DBA_EXTENTS`, and target table is not partitioned.
`ROW_SPLITTER`	When Oracle user does not have `SELECT` privilege on `SYS.DBA_EXTENTS`.
`CUSTOM_SPLITTER`	For fine grain control over generated splits.

9.4.1.3 Predicate Pushdown

Predicate Pushdown is an optimization technique, in which you push predicates (WHERE condition) down to be evaluated by Oracle Database at the time of querying. This minimizes the amount of data fetched from Oracle Database to Hive, while performing a query.

Set the configuration property hive.optimize.ppd to either true or false for enabling Predicate Pushdown. The default value on hive-1.1.0 is set to true. Hence, Predicate Pushdown is always performed, unless you want to disable it.

Note:

OD4H does not push down all possible predicates. It considers only the part of the execution plan pertaining to Oracle table declared as external table. OD4H also rewrites sub-queries for the Oracle SQL engine and each split task. At present conditions involving operators >,=,< in a single condition over a column (e.g. key > 10) or a combination of multiple conditions separated by AND (e.g. key > 10 AND key < 20 AND key !=17) are pushed down.

Another option to reduce the amount of data fetched from the Oracle Database is to specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions. For instance:

mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.

This will restrict the rows fetched from Oracle Database whenever any query is performed based on the condition specified. The external table that gets created, is analogous to a view on Oracle Database. This approach is only useful when you want to push down complex predicates that cannot be analyzed and automatically pushed down by OD4H.

Table Level Predicate Pushdown

For Table Level Predicate Pushdown to be enabled, you must specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions.

Following is an illustration:

mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.

This will restrict the rows fetched from Oracle Database when any query is performed based on the condition specified. The table created will be analogous to a view on Oracle database.

However, Table Level Predicate Pushdown is ignored when a predicate (aWHERE clause) is specified in the query.

9.4.1.4 Projection Pushdown

Projection Pushdown is an optimization technique that fetches only the required columns from Oracle Database when a query is performed. If you want to fetch all columns during a query (not recommended), you can disable it by setting the hive.io.file.read.all.columns connection property to true. On Hive–1.1.0, this property is false by default.

9.4.1.5 Partition Pruning

If you refer to Employee Data Partition table, the partitions irrelevant to the query are removed from the partition list. This is done by executing an explain plan on the query to obtain the list of partitions and sub-partitions that are relevant to the query.

Table level partition pruning uses table level predicate pushdown, on the other hand partition pruning at the query level uses query level predicate pushdown.

Partition pruning is active when a SELECT query is run, in which the WHERE clause uses the partitioning key. Following is an example of partition pruning:

To query the partition, where salary is in the above range and prune other partitions, perform the following:

Hive External Table:

CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:connection_string}',
 'mapreduce.jdbc.username' = '${hiveconf:oracle_user}',
 'mapreduce.jdbc.password' = '${hiveconf:oracle_pwd}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.oosKind' = 'PARTITIONED_TABLE'
);

The following SELECT statement shows how to query the partition, where salary is between 72000 to 78000, and prunes other partitions:

select * from EmployeeDataPartitioned where salary > 72000 and salary < 78000;

9.4.2 Smart Connection Management

Connection Caching

Each map task runs in its own JVM. Each JVM in turn caches a single connection to the Oracle database that you can reuse within the same query. The Mapper checks the cache before establishing a new connection and caching is not done once the query has completed executing.

Oracle RAC Awareness

JDBC and UCP are aware of various Oracle RAC instances. This can be used to split queries submitted to JDBC. The StorageHandler will depend on listener for load balancing.

Handling Logon Storms

Hadoop allows you to limit the number of mappers attempting to connect to the Database. Hadoop allows you to limit the number of mappers attempting to connect to the Database using oracle.hcat.osh.maxSplits. This parameter controls the degree of concurrency. However, subsequent tasks of the same query are guaranteed to query their table granule as per the System Commit Number (SCN) of the query. This ensures consistency of the result sets.

Database Resident Connection Pooling (DRCP)

It is recommended to configure DRCP for OD4H, and limit the maximum number of concurrent connections to the Oracle Database from OD4H.

Configuring Database Resident Connection Pooling

To configure DRCP, use the following steps:

Login as SYSDBA.
Start the default pool, SYS_DEFAULT_CONNECTION_POOL using DBMS_CONNECTION_POOL.START_POOL with the default settings.

You can use DBMS_CONNECTION_POOL.MINSIZE and DBMS_CONNECTION_POOL.MAXSIZE with the default settings.

Note:

Oracle Database Administrator’s Guide for more information on configuring DRCP.

9.4.3 Security Features

Following are the security features of OD4H:

9.4.3.1 Improved Authentication

OD4H uses Oracle JDBC driver for connecting to Oracle Database. It provides all authentication methods supported by Oracle JDBC. OD4H supports authentication through use of basic authentication (username and password), Oracle Wallet, and Kerberos. You can specify the authentication to be used for a table created in Hive, through the oracle.hcat.osh.authentication table property. This is useful only for strong authentication.

Kerberos
Oracle Wallet
Basic Authentication

Note:

Oracle recommends using strong authentication such as Kerberos.

The various authentication processes are described with examples as follows:

Kerberos

Uses Kerberos credentials of the Hadoop engine process. This principal should have access to the table.

9.5 Using HiveQL with OD4H

HiveQL is a SQL like language provided by Hive. It can be used to query hive external tables created using OD4H.

You can run the Resource Manager web interface in your browser (http://bigdatalite.localdomain:8088/cluster), to track the status of a running query on BDA.

You can also see the logs of a query in Cloudera Manager, which also indicates the actual query sent to Oracle Database corresponding to your query on HiveQL. Hive and OD4H use slf4j framework for logging. You can control logging level for OD4H related classes using logging configuration techniques of Hive.

9.6 Using Spark SQL with OD4H

Spark SQL enables relational queries expressed in SQL and HiveSQL to be executed using Spark. Spark SQL allows you to mix SQL queries with programmatic data manipulations supported by RDDs (Resilient Distributed Datasets) in Java, Python and Scala, with a single application.

Spark SQL enables you to submit relational queries using SQL or HiveQL. You can also use it to query external tables created using OD4H.

Perform the following steps to configure Spark-SQL on BigDataLite-4.2 VM, before running queries:

Add ojdbc7.jar and osh.jar to CLASSPATH in /usr/lib/spark/bin/compute-classpath.sh

CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/osh.jar"
CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/ojdbc7.jar"

Edit SPARK_HOME in /usr/lib/spark/conf/spark-env.sh

export SPARK_HOME=/usr/lib/spark:/etc/hive/conf

You will need to specify additional environment variables in /usr/lib/spark/conf/spark-env.sh.

The Hive related variables that need to be added are marked in bold. The file already contains Hadoop related environment variables.

export DEFAULT_HADOOP=/usr/lib/hadoop
export DEFAULT_HIVE=/usr/lib/hive
export DEFAULT_HADOOP_CONF=/etc/hadoop/conf
export DEFAULT_HIVE_CONF=/etc/hive/conf
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP}
export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$DEFAULT_HADOOP_CONF}
export HIVE_CONF_DIR=${HIVE_CONF_DIR:-$DEFAULT_HIVE_CONF}
 
CLASSPATH="$CLASSPATH:$HIVE_CONF_DIR"
CLASSPATH="$CLASSPATH:$HADOOP_CONF_DIR"
 
if [ "x" != "x$YARN_CONF_DIR" ]; then
  CLASSPATH="$CLASSPATH:$YARN_CONF_DIR"
fi
 
# Let's make sure that all needed hadoop libs are added properly
CLASSPATH="$CLASSPATH:$HADOOP_HOME/client/*"
CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/*"
CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"

Once configured, you can run some sample queries on spark SQL using scripts included in demo:/shell/*QuerySpark.sh. By default, Spark prints queries on the console. To modify this behavior you can edit the spark logging configuration file /usr/lib/spark/conf/log4j.properties.

The log printed by OracleRecordReader shows the actual query sent to Oracle Database, as follows:

15/03/18 10:36:08 INFO OracleRecordReader: Reading records from Oracle Table using Query: SELECT FIRST_NAME, LAST_NAME, EMP_ID FROM EmployeeData

9.7 Writing Back to Oracle Database

In the typical use case for OD4H, you store the result sets of Hive or Spark SQL queries back to Oracle Database. OD4H implements OutputFormat to enable you to write back to an Oracle Database table from Hadoop.

After the data is inserted into an Oracle Database table, you can then use your favorite business intelligence tools for further data mining

The following query is from the OD4H demo code samples. It demonstrates writing back to an external table called EmployeeBonusReport.

Example 9-1 Writing Hive or Spark Result Sets Back to Oracle Database

INSERT INTO EmployeeBonusReport 
             SELECT EmployeeDataSimple.First_Name, EmployeeDataSimple.Last_Name,
                    EmployeeBonus.bonus 
             FROM EmployeeDataSimple JOIN EmployeeBonus ON
                        (EmployeeDataSimple.Emp_ID=EmployeeBonus.Emp_ID)
             WHERE salary > 70000 and bonus > 7000"

Oracle Data Type	Hive Data Type
`NUMBER`	INT when the scale is 0 and the precision is less than 10. BIGNIT when the scale is 0 and precision is less than 19. DECIMAL when the scale is greater than 0 or the precision is greater than 19.
`CLOB` `NCLOB`	`STRING`
`BINARY_DOUBLE`	`DOUBLE`
`BINARY_FLOAT`	`FLOAT`
`BLOB`	`BINARY`
`CHAR` `NCHAR`	`CHAR`
`VARCHAR2` `NVARCHAR2`	`VARCHAR`
`ROWID` `UROWID`	`BINARY`
`DATE`	`TIMESTAMP`
`TIMESTAMP`	`TIMESTAMP`
`TIMESTAMPTZ` `TIMESTAMPLTZ`	Unsupported
`RAW`	`BINARY`