9 Oracle Table Access for Hadoop and Spark (OTA4H)

Oracle Table Access for Hadoop and Spark (OTA4H) is an Oracle Big Data Appliance feature that converts Oracle tables to Hadoop and Spark datasources. OTA4H allows direct, fast, parallel, secure and consistent access to master data in Oracle database using Hive SQL, Spark SQL, as well as Hadoop and Spark APIs that support SerDes, HCatalog, InputFormat and StorageHandler. This chapter discusses Oracle Table Access for Hadoop and Spark (OTA4H) in the following sections:

9.1 Operational Data, Big Data and Requirements

The common data architecture in most companies nowadays generally comprises of the following components:

  • Oracle Database(s) for operational, transactional, and master data, that is shared business object such as customers, products, employees and so on

  • Big Data

Hadoop applications such as Master Data Management (MDM), Events processing, and others, need access to data in both Hadoop storages (such as HDFS and NoSQL Database as a landing point for weblogs, and so on) and Oracle Database (as the reliable and auditable source of truth). There are two approaches to process such data that reside in both Hadoop storage and Oracle Database:

  • ETL Copy using tools such as Oracle's Copy2BDA

  • Direct Access using Oracle Big Data SQL and Oracle Table Access for Hadoop and Spark (OTA4H).

In this chapter, we will discuss Oracle Table Access for Hadoop and Spark (OTA4H).

9.2 Overview of Oracle Table Access for Hadoop and Spark (OTA4H)

Oracle Table Access for Hadoop and Spark (OTA4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.

This section discusses the following concepts:

9.2.1 Opportunity with Hadoop 2.x

Hadoop 2.x architecture decouples compute engines from cluster resources management and storages. It enables:

  • A variety of SQL query engines. For instance, HIve SQL, Spark SQL, Big Data SQL, and so on.

  • A variety of programmatic compute engines. For instance, MapReduce, Pig, Storm, Solr, Cascading, and so on.

  • Elastic allocation of compute resources (CPU, memory) through YARN.

  • A variety of data stores such as HDFS, NoSQL, as well as remote storages through HCatalog, InputFormat, OutputFormat and StorageHandler interfaces.

OTA4H is the storage handler for Oracle Database that uses HCatalog and InputFormat.

Following is an illustration of Hadoop 2.0 Architecture:

Figure 9-1 Hadoop 2.0 Architecture

This figure shows the architecture of Hadoop 2.0

9.2.2 Oracle Tables as Hadoop Data Source

OTA4H enables current and ad-hoc querying. This makes querying data faster and more secure. You can query data directly and retrieve only the data that you need, when you need it.

OTA4H also provides Oracle’s end-to-end security. This includes Identity Management, Column Masking, and Label and Row Security.

OTA4H also furnishes direct access for Hadoop and Spark APIs such as Pig, MapReduce and others.

9.2.3 External Tables

External Tables turn Oracle tables into Hadoop and/or Spark datasources. The DDL for declaring External Tables is as follows:

[(col_name data_type [COMMENTcol_comment],...)]
[COMMENT table_comment]
STORED BY 'oracle.hcat.osh.OracleStorageHandler' [WITHSERDEPROPERTIES(...)]
[TBLPROPERTIES (property_name=property_value,...)]

See Also:

Refer the following link for Hive External Table syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable


Oracle supports only primitive types.

The properties of external tables can be described as follows: TBLPROPERTIES

Property Use


Comma separated list to specify mapping between Hive columns and Oracle table columns. All external tables using OracleStorageHandler must define this.


Connection URL to connect to the database


Connection user name to connect to the database


Connection password to connect to the database


Oracle table name

mapreduce.jdbc.input conditions

To be used for querying the database. Must be used for query pushdown.


To be used for querying the database. Query should be used only when a subset of the columns is selected.


ORDER BY clause to be specified for pushing ordering to the database.


To be used to specify how OracleStorageHandler must create splits, so that they are a good match for the physical structure of the target table in Oracle Database. The splitter kind applicable could be SINGLE_SPLITTER, PARTITION_SPLITTER, ROW_SPLITTER, BLOCK_SPLITTER.


Used only when ROW_SPLITTER splitterKind is applied on the table. Represents Number of rows per split for LIMIT_RANGE splitter. Default is 1000


Authentication method used to connect to Oracle Database. Can be SIMPLE (default), ORACLE_WALLET, KERBEROS


Kerberos principal. Used only when KERBEROS authentication is applied.


Callback for Kerberos authentication. Used only when Kerberos authentication is applied.


Maximum number of splits for any splitter kind


In addition to the above, any JDBC connection properties (oracle.jdbc.* and oracle.net.*) can be specified as TBLPROPERTIES. They will be used while establishing connection to Oracle Database using JDBC driver.


Oracle Table Access for Hadoop (OTA4H) works with Oracle View and Oracle Tables. SERDE PROPERTIES

Property Use


All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. WITHSERDEPROPERTIES also enables the external table definition to refer only to select columns in the actual Oracle table. In other words, not all columns from the Oracle table need to be part of the Hive external table. The ordering of oracle columns in the mapping is the same as ordering of hive columns specified in create table.

9.2.4 List of jars in the OTA4H package

OTA4H consists of the following list of jars. All jars are located under /opt/oracle/ota4h/jlib in BDA.

List of jars in OTA4H

Name of jars Use
osh.jar Contains OracleStorageHandler Implementation
ojdbc7.jar An OTA4H specific JDBC driver (which is optimized with internal calls), used by Spark or Hadoop tasks to connect to the database.
ucp.jar For creating connection pools in OracleStorageHandler
oraclepki103.jar, osdt_core.jar, osdt_cert.jar, osdt_jce.jar For Oracle Wallet authentication
orai18n.jar Oracle Globalization Support

9.2.5 Creating External Tables in Hive

You can create an external table in Hive in the following way:

DROP TABLE employees;
  STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')
    'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',
    'mapreduce.jdbc.username' = 'hr',
    'mapreduce.jdbc.password' = 'hr',
    'mapreduce.jdbc.input.table.name' = 'EMPLOYEES'


Include ucp.jar, ojdbc7.jar and osh.jar in the Hive auxpath controlled by HIVE_AUX_JARS_PATH environment variable that is present in hive-env.sh, hive.aux.jars.path configuration property or --auxpath option when you invoke Hive. On BDA, you can configure these using Cloudera Manager interface. You should also add these jars to classpath of hadoop tasks using add jar on Hive command line.

For various Hive Command Line options and configuration properties, refer the following sources:



9.3 How does OTA4H work?

Oracle Table Access for Hadoop and Spark (OTA4H) does not require creating a new table. You can start working with OTA4H using the following steps:

  1. Create the Hive DDL for creating the external table referencing the Oracle Table.

  2. Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.

The following sections show how to create a new Oracle Database Table, and a Hive DDL.

Create a New Oracle Database Table

Hive DDL

9.3.1 Create a new Oracle Database Table

Here is an illustration of a partitioned Oracle table that we will use to demo how partition pruning works:

  1. CREATE TABLE EmployeeData ( Emp_ID NUMBER,
        First_Name VARCHAR2(20),
        Last_Name VARCHAR2(20),
        Job_Title VARCHAR2(40),
        Salary NUMBER)
     ( PARTITION salary_1 VALUES LESS THAN (60000)
        TABLESPACE tsa
     , PARTITION salary_2 VALUES LESS THAN (70000)
        TABLESPACE tsb
     , PARTITION salary_3 VALUES LESS THAN (80000)
        TABLESPACE tsc
     , PARTITION salary_4 VALUES LESS THAN (90000)
        TABLESPACE tsd
     , PARTITION salary_5 VALUES LESS THAN (100000)
        TABLESPACE tse


    You can use this syntax for table creation, in the following examples listed in this Book.

  2. Issue queries from Hive, Spark, or any other Hadoop models (including joins with local Hive Tables.)

9.3.2 Hive DDL

In this example, we will associate two Hive external tables to the same Oracle table, using two different split patterns:




It is possible that the external table has fewer columns than the base Oracle table.

Since columns can have different names, use TBLPROPERTY for mapping with the base table.

In the following examples, we are using the following variables:

connection_string = jdbc:oracle:thin:@localhost:1521/<servicename>



The following command creates a Hive external table with the default split pattern, that is SIMPLE_SPLITTER.

 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:ota4h}',
 'mapreduce.jdbc.password' = '${hiveconf:ota4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData'

The following example creates a Hive external table using PARTITION_SPLITTER.

DROP TABLE EmployeeDataPartitioned;
CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:ota4h}',
 'mapreduce.jdbc.password' = '${hiveconf:ota4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.splitterKind' = 'PARTITIONED_TABLE'

See Also:

Big Data Appliance page for demo code samples

9.4 Features of OTA4H

The following topics discuss features of OTA4H.

9.4.1 Performance And Scalability Features

Following sections discuss the performance and scalability features of OTA4H:

HCatalog stores table metadata from Hive DDL. HiveSQl, Spark SQL and others, then use this metadata while submitting queries.

The Oracle table is divided into granules determined by the splitterKind property. These granules are then read into a split by OracleStorageHandler, by submitting generated queries.

OracleStorageHandler will not have to test all possible query types if the query plan determines which splits need to be scanned.

Figure 9-2 OTA4H in a Nutshell

Table Access for Hadoop in a nutshell Splitters

While executing a query on a Hive external table through OTA4H, the underlying Oracle table is dynamically divided into granules, which correspond to splits on the Hadoop side. Each split is processed by a single map task. With the help of the ORACLE_SPLITTER_KIND property, you can specify how the splits are created. This ensures that the splits are a good match for the physical structure of the target table in Oracle Database.

The different kinds of splitters available are:


Creates one split for the table. Use SINGLE_SPLITTERwhere a single task is sufficient to process the query against the entire table.


Limits the number of rows per Split. The default number of rows is 1000. You can specify number of rows by setting the oracle.hcat.osh.rowsPerSplit property. The default value of oracle.hcat.osh.maxSplits is 1 when ROW_SPLITTER is used. You can increase this value to enable parallel reads.

Based on the values provided in the rowsPerSplit property, OTA4H will divide tables into splits. If the number of splits obtained is higher than the maxSplits, then maxSplits property will be used. The rows per split will be divided accordingly.


oracle.hcat.osh.rowsPerSplitis used only by ROW_SPLITTER and not any other splitter kind.


Creates splits based on underlying storage of data blocks. With Block Splitter, you can specify the maximum number of splits to be generated. The default value of oracle.hcat.osh.maxSplits is 1, when BLOCK_SPLITTER is used. You can increase this value to enable parallel reads. BLOCK_SPLITTER requires SELECT privilege on the SYS.DBA.EXTENTS table, granted to the schema containing the Oracle target table. In the event that this permission does not exist, OTA4H will use SINGLE_SPLITTER.


The actual number of splits under BLOCK_SPLITTER may be lesser than the value specified in the oracle.hcat.osh.maxSplits property.

Do not use BLOCK_SPLITTER on partitioned tables or Index Organized tables.


Creates one split per partition. PARTITION_SPLITTER is used by default when the table is partitioned. You can override this setting by specifying ROW_SPLITTER in table properties. With PARTITION_SPLITTER, the default value of oracle.hcat.osh.maxSplits table property is 64.

Following is an illustration of ROW_SPLITTER:

DROP TABLE employees;
 STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')

'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',     
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES',     
'oracle.hcat.osh.splitterKind' = 'ROW_SPLITTER',     
'oracle.hcat.osh.rowsPerSplit' = '1500' 
); Choosing a Splitter

SINGLE_SPLITTER is used by default if no splitter is specified in the table properties for Hive external table, and the target Oracle table is not partitioned.

For an unpartitioned table, the default value of oracle.hcat.osh.maxSplits will be 1. For partitioned table, the default value of the same will be 64, and the default splitter will be PARTITION_SPLITTER. The default for maxSplits is set to limit the number of connections to the Oracle server. To increase this limit, you must increase the value of oracle.hcat.osh.maxSplits explicitly in hive table properties.

Use the following guidelines while choosing a splitter kind for a hive external table:

Splitter Kind Use


When no parallelism is required.


Used by default when target table is partitioned


When Oracle user has SELECT privilege on SYS.DBA_EXTENTS, and target table is not partitioned.


When Oracle user does not have SELECT privilege on SYS.DBA_EXTENTS. Predicate Pushdown

Predicate Pushdown is an optimization technique, in which you push predicates (WHERE condition) down to be evaluated by Oracle Database at the time of querying. This minimizes the amount of data fetched from Oracle Database to Hive, while performing a query.

Set the configuration property hive.optimize.ppd to either true or false for enabling Predicate Pushdown. The default value on hive-1.1.0 is set to true. Hence, Predicate Pushdown is always performed, unless you want to disable it.


OTA4H does not push down all possible predicates. It considers only the part of the execution plan pertaining to Oracle table declared as external table. OTA4H also rewrites sub-queries for the Oracle SQL engine and each split task. At present conditions involving operators >,=,< in a single condition over a column (e.g. key > 10) or a combination of multiple conditions separated by AND (e.g. key > 10 AND key < 20 AND key !=17) are pushed down.

Another option to reduce the amount of data fetched from the Oracle Database is to specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions. For instance:

mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.

This will restrict the rows fetched from Oracle Database whenever any query is performed based on the condition specified. The external table that gets created, is analogous to a view on Oracle Database. This approach is only useful when you want to push down complex predicates that cannot be analyzed and automatically pushed down by OTA4H.

Table Level Predicate Pushdown

For Table Level Predicate Pushdown to be enabled, you must specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions.

Following is an illustration:

mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.

This will restrict the rows fetched from Oracle Database when any query is performed based on the condition specified. The table created will be analogous to a view on Oracle database.

However, Table Level Predicate Pushdown is ignored when a predicate (aWHERE clause) is specified in the query. Projection Pushdown

Projection Pushdown is an optimization technique that fetches only the required columns from Oracle Database when a query is performed. If you want to fetch all columns during a query (not recommended), you can disable it by setting the hive.io.file.read.all.columns connection property to true. On Hive–1.1.0, this property is false by default. Partition Pruning

If you refer to Employee Data Partition table, the partitions irrelevant to the query are removed from the partition list. This is done by executing an explain plan on the query to obtain the list of partitions and sub-partitions that are relevant to the query.

Table level partition pruning uses table level predicate pushdown, on the other hand partition pruning at the query level uses query level predicate pushdown.

Partition pruning is active when a SELECT query is run, in which the WHERE clause uses the partitioning key. Following is an example of partition pruning:

To query the partition, where salary is in the above range and prune other partitions, perform the following:

Hive External Table:

CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
 'mapreduce.jdbc.url' = '${hiveconf:connection_string}',
 'mapreduce.jdbc.username' = '${hiveconf:oracle_user}',
 'mapreduce.jdbc.password' = '${hiveconf:oracle_pwd}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.oosKind' = 'PARTITIONED_TABLE'

The following SELECT statement shows how to query the partition, where salary is between 72000 to 78000, and prunes other partitions:

select * from EmployeeDataPartitioned where salary > 72000 and salary < 78000;

9.4.2 Smart Connection Management

Connection Caching

Each map task runs in its own JVM. Each JVM in turn caches a single connection to the Oracle database that you can reuse within the same query. The Mapper checks the cache before establishing a new connection and caching is not done once the query has completed executing.

Oracle RAC Awareness

JDBC and UCP are aware of various Oracle RAC instances. This can be used to split queries submitted to JDBC. The StorageHandler will depend on listener for load balancing.


You can spread queries across multiple instances in replicated databases to achieve parallelization.

Handling Logon Storms

Hadoop allows you to limit the number of mappers attempting to connect to the Database. Hadoop allows you to limit the number of mappers attempting to connect to the Database using oracle.hcat.osh.maxSplits. This parameter controls the degree of concurrency. However, subsequent tasks of the same query are guaranteed to query their table granule as per the System Commit Number (SCN) of the query. This ensures consistency of the result sets.

Database Resident Connection Pooling (DRCP)

It is recommended to configure DRCP for OTA4H, and limit the maximum number of connections you would like to dedicate.

Configuring Database Resident Connection Pooling

To configure DRCP, use the following steps:

  1. Login as SYSDBA.

  2. Start the default pool, SYS_DEFAULT_CONNECTION_POOL using DBMS_CONNECTION_POOL.START_POOL with the default settings.

    You can use DBMS_CONNECTION_POOL.MINSIZE and DBMS_CONNECTION_POOL.MAXSIZE with the default settings.


Oracle Database Administrator’s Guide for more information on Configuring DRCP.

9.4.3 Security Features

Following are the security features of OTA4H: Improved Authentication

OTA4H uses Oracle JDBC driver for connecting to Oracle Database. It provides all authentication methods supported by Oracle JDBC. OTA4H supports authentication through use of basic authentication (username and password), Oracle Wallet, and Kerberos. You can specify the authentication to be used for a table created in Hive, through the oracle.hcat.osh.authentication table property. This is useful only for strong authentication.

  • Kerberos

  • Oracle Wallet

  • Basic Authentication


Oracle recommends using strong authentication such as Kerberos.

The various authentication processes are described with examples as follows:

  1. Kerberos

    Uses Kerberos credentials of the Hadoop engine process. This principal should have access to the table.

    See Also:

    Oracle Database JDBC Developer's Guide for information on configuring database for Kerberos and details of client parameters

    You can enable Kerberos configuration on Hive, by adding to hive-env.sh the following:

    export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.conf=<path to kerberos configuration>

    To enable child JVMs to use Kerberos configuration, edit the mapred-site.xml to include the following property on all nodes of the cluster:

    <property><name>mapred.child.java.opts</name>  <value>-Djava.security.krb5.conf=<path to kerberos configuration>></value></property>

    Enable these configurations on BDA using Cloudera manager..

    Following is an illustration of Kerberos authentication:

    CREATE EXTERNAL TABLE kerb_example (
    id DECIMAL,
    name STRING,
    salary DECIMAL
    STORED BY 'oracle.hcat.osh.OracleStorageHandler'
                    'oracle.hcat.osh.columns.mapping' = 'id,name,salary')
    'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)
    'mapreduce.jdbc.input.table.name' = 'kerb_example',
    'mapreduce.jdbc.username' = 'CLIENT@xxxxxx.COM',
    'oracle.hcat.osh.authentication' = 'KERBEROS',
    'oracle.net.kerberos5_cc_name' = '/tmp/krb5cc_xxxxx',
    'java.security.krb5.conf' = '/home/user/kerberos/krb5.conf',
    'oracle.hcat.osh.kerb.callback' = 'KrbCallbackHandler',
    'sun.security.krb5.principal' = 'CLIENT@xxxxx.COM'

    The path specified in oracle.security.krb5.conf should be accessible to all nodes of the cluster. These paths should also match with the path of the corresponding properties in Oracle Database sqlnet.ora.The keytab path provided in sqlnet.ora should also be accessible from all nodes of the cluster.

    If sun.security.krb5.principal is not specified, OTA4H will attempt to authenticate using default principal in Credential Cache specified by the oracle.net.kerberos5_cc_name property.


    The callback will be called only if the principal cannot be authenticated using a ticket obtained from the credential cache specified in oracle.net.kerberos5_cc_nameproperty.

    A simple callback handler class is described as follows (The callback class must be available to the hive classpath):

    class KrbCallbackHandler 
                    implements CallbackHandler{
    public void handle(Callback[] callbacks) throws IOException,
    for (int i = 0; i < callbacks.length; i++){
            if (callbacks[i] instanceof PasswordCallback){
            PasswordCallback pc = (PasswordCallback)callbacks[i];
            System.out.println("set password to 'welcome'");
            pc.setPassword((new String("welcome")).toCharArray());
    } else if (callbacks[i] instanceof NameCallback) {
            throw new UnsupportedCallbackException(callbacks[i],
                            "Unrecognized Callback");
  2. Oracle Wallet

    The wallet should be available in the OS environment of each engine process. Following is an illustration of how to add Wallet authentication:

    CREATE EXTERNAL TABLE wallet_example (
            id DECIMAL,
            name STRING,
            salary DECIMAL
    STORED BY 'oracle.hcat.osh.OracleStorageHandler'
                    'oracle.hcat.osh.columns.mapping' = 'id,name,salary')
    'mapreduce.jdbc.url' = 'jdbc:oracle:thin:/@inst1',
    'mapreduce.jdbc.input.table.name' = 'wallet_example',
    'oracle.hcat.osh.authentication' = 'ORACLE_WALLET',
    'oracle.net.tns_admin' = '/scratch/user/view_storage/user_project6/work',
    'oracle.net.wallet_location' = '/scratch/user/view_storage/user_project6/work'


    The paths specified in oracle.net.tns_admin and oracle.net.wallet_location should be accessible from all nodes of the cluster.

    See Also:

    Managing the Secure External Password Store for Password Credentials section in the Oracle Database Security Guide.

  3. Basic Authentication (for demo purposes only)

    This is stored in HCatalog TBLPROPERTIES or supplied on HiveQL SELECT statement.

    When Basic Authentication is used, the username and password for Oracle Schema is specified in Hive external Table properties.


    Oracle does not recommend this in the production environment, since the password is stored in clear in HCatalog.

9.5 Using HiveQL with OTA4H

HiveQL is a SQL like language provided by Hive. It can be used to query hive external tables created using OTA4H.

You can run the Resource Manager web interface in your browser (http://bigdatalite.localdomain:8088/cluster), to track the status of a running query on BDA.

You can also see the logs of a query in Cloudera Manager, which also indicates the actual query sent to Oracle Database corresponding to your query on HiveQL. Hive and OTA4H use slf4j framework for logging. You can control logging level for OTA4H related classes using logging configuration techniques of Hive.

9.6 Using Spark SQL with OTA4H

Spark is a cluster programming platform designed to be fast, and general purpose. Spark enables users to load data, and to run queries, from memory. Spark can handle a diverse range of workloads in the same engine including interactive queries, batch applications etc.

Spark SQL enables relational queries expressed in SQL and HiveSQL to be executed using Spark. Spark SQL allows you to mix SQL queries with programmatic data manipulations supported by RDDs (Resilient Distributed Datasets) in Java, Python and Scala, with a single application.

Spark SQL enables you to submit relational queries using SQL or HiveQL. You can also use it to query external tables created using OTA4H.

Perform the following steps to configure Spark-SQL on BigDataLite-4.2 VM, before running queries:

  1. Add ojdbc7.jar and osh.jar to CLASSPATH in /usr/lib/spark/bin/compute-classpath.sh

  2. Edit SPARK_HOME in /usr/lib/spark/conf/spark-env.sh

    export SPARK_HOME=/usr/lib/spark:/etc/hive/conf
  3. You will need to specify additional environment variables in /usr/lib/spark/conf/spark-env.sh.

    The Hive related variables that need to be added are marked in bold. The file already contains Hadoop related environment variables.

    export DEFAULT_HADOOP=/usr/lib/hadoop
    export DEFAULT_HIVE=/usr/lib/hive
    export DEFAULT_HADOOP_CONF=/etc/hadoop/conf
    export DEFAULT_HIVE_CONF=/etc/hive/conf
    export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs}
    export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoop-mapreduce}
    export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn}
    if [ "x" != "x$YARN_CONF_DIR" ]; then
    # Let's make sure that all needed hadoop libs are added properly
    CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"

Once configured, you can run some sample queries on spark SQL using scripts included in demo:/shell/*QuerySpark.sh. By default, Spark prints queries on the console. To modify this behavior you can edit the spark logging configuration file /usr/lib/spark/conf/log4j.properties.

The log printed by OracleRecordReader shows the actual query sent to Oracle Database, as follows:

15/03/18 10:36:08 INFO OracleRecordReader: Reading records from Oracle Table using Query: SELECT FIRST_NAME, LAST_NAME, EMP_ID FROM EmployeeData