Oracle DataSource for Apache Hadoop (OD4H)

6.1 Operational Data, Big Data and Requirements

The typical data architecture in most companies comprises of the following components:

Oracle Database for operational, transactional, and master data, which is shared business object
Big Data

Hadoop applications (such as Master Data Management (MDM) or Events processing) need access to data in both Hadoop storages (such as HDFS and NoSQL Database as a landing point for Web logs) and Oracle Database (as the reliable and auditable source of truth). You can adopt one of the following approaches to process such data that reside in both Hadoop storage and Oracle Database:

Perform an Extract, Transform, and Load (ETL) Copy using tools such as Copy to Hadoop tool of Oracle
Access the data directly using Oracle Big Data SQL and Oracle DataSource for Apache Hadoop (OD4H)

This chapter discusses Oracle DataSource for Apache Hadoop (OD4H).

6.2 Overview of Oracle DataSource for Apache Hadoop (OD4H)

Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.

This section discusses the following concepts:

6.2.1 Opportunity with Hadoop 2.x

Hadoop 2.x architecture decouples compute engines from cluster resources management and storages. It enables:

A variety of SQL query engines. For instance, Hive SQL, Spark SQL, Big Data SQL, and so on.
A variety of programmatic compute engines. For instance, MapReduce, Pig, Storm, Solr, Cascading, and so on.
Elastic allocation of compute resources (CPU, memory) through YARN.
A variety of data stores such as HDFS, NoSQL, as well as remote storages through HCatalog, InputFormat, OutputFormat and StorageHandler interfaces.

Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.

Following is an illustration of Hadoop 2.0 Architecture:

Figure 6-1 Hadoop 2.0 Architecture

Description of "Figure 6-1 Hadoop 2.0 Architecture"

6.2.2 Oracle Tables as Hadoop Data Source

OD4H enables current and ad-hoc querying. This makes querying data faster and more secure. You can query data directly and retrieve only the data that you need, when you need it.

OD4H also provides Oracle’s end-to-end security. This includes Identity Management, Column Masking, and Label and Row Security.

OD4H also furnishes direct access for Hadoop and Spark APIs such as Pig, MapReduce and others.

6.2.3 External Tables

External Tables turn Oracle tables into Hadoop and/or Spark datasources. The DDL for declaring External Tables is as follows:

CREATE[TEMPORARY] EXTERNAL TABLE [IF NOT EXISTS]  [db_name.]table_name
[(col_name data_type [COMMENTcol_comment],...)]
[COMMENT table_comment]
STORED BY 'oracle.hcat.osh.OracleStorageHandler' [WITHSERDEPROPERTIES(...)]
[TBLPROPERTIES (property_name=property_value,...)]
 
data_type
|SMALLINT
|INT
|BIGINT
|BOOLEAN
|FLOAT
|DOUBLE
|STRING
|BINARY
|TIMESTAMP
|DECIMAL
|DECIMAL(precision,scale)
|VARCHAR
|CHAR

See Also:

Refer the following link for Hive External Table syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Note:

Oracle supports only primitive types.

The following table shows the mappings between Oracle and Hive types.

Oracle Data Type Hive Data Type

NUMBER

INT when the scale is 0 and the precision is less than 10.

BIGNIT when the scale is 0 and precision is less than 19.

DECIMAL when the scale is greater than 0 or the precision is greater than 19.

CLOB

NCLOB

STRING

BINARY_DOUBLE

DOUBLE

BINARY_FLOAT

FLOAT

BLOB

BINARY

CHAR

NCHAR

CHAR

VARCHAR2

NVARCHAR2

VARCHAR

ROWID

UROWID

BINARY

DATE TIMESTAMP

TIMESTAMP TIMESTAMP

TIMESTAMPTZ

TIMESTAMPLTZ

Unsupported

RAW BINARY

The properties of external tables can be described as follows:

6.2.3.1 TBLPROPERTIES

Property	Use
oracle.hcat.osh.columns.mapping	Comma separated list to specify mapping between Hive columns and Oracle table columns. All external tables using OracleStorageHandler must define this.
mapreduce.jdbc.url	Connection URL to connect to the database
mapreduce.jdbc.username	Connection user name to connect to the database
mapreduce.jdbc.password	Connection password to connect to the database
mapreduce.jdbc.input.table.name	Oracle table name
mapreduce.jdbc.input conditions	To be used for querying the database. Must be used for query pushdown.
mapreduce.jdbc.input.query	To be used for querying the database. Query should be used only when a subset of the columns is selected.
mapreduce.jdbc.input.orderby	`ORDER BY` clause to be specified for pushing ordering to the database.
oracle.hcat.osh.splitterKind	To be used to specify how OracleStorageHandler must create splits, so that they are a good match for the physical structure of the target table in Oracle Database. The splitter kind applicable could be `SINGLE_SPLITTER`, `PARTITION_SPLITTER`, `ROW_SPLITTER`, `BLOCK_SPLITTER`.
oracle.hcat.osh.rowsPerSplit	Used only when `ROW_SPLITTER` splitterKind is applied on the table. Represents Number of rows per split for `LIMIT_RANGE` splitter. Default is 1000
oracle.hcat.osh.authentication	Authentication method used to connect to Oracle Database. Can be `SIMPLE` (default), `ORACLE_WALLET`, `KERBEROS`
sun.security.krb5.principal	Kerberos principal. Used only when `KERBEROS` authentication is applied.
oracle.hcat.osh.kerb.callback	Callback for Kerberos authentication. Used only when Kerberos authentication is applied.
oracle.hcat.osh.maxSplits	Maximum number of splits for any splitter kind
oracle.hcat.osh.useChunkSplitter	Use chunk based ROW_SPLITTER and BLOCK_SPLITTER that use DBMS_PARALLEL_EXECUTE package to divide table into chunks that will map to hadoop splits.The default value is set to ‘true’.
oracle.hcat.osh.chunkSQL	Used by `CUSTOM_SPLITTER`to create splits. The SQL string should be a SELECT statement that returns range of each chunk must have two columns: `start_id` and `end_id` The columns must be of `ROWID` type.
oracle.hcat.osh.useOracleParallelism	When configured, parallel queries will be executed while fetching rows from Oracle. Default value: ‘`false`’
oracle.hcat.osh.fetchSize	JDBC fetchsize for generated select queries used to fetch rows. Default value: 10 (set by Oracle JDBC Driver)

Note:

In addition to the above, any JDBC connection properties (oracle.jdbc.* and oracle.net.*) can be specified as TBLPROPERTIES. They will be used while establishing connection to Oracle Database using JDBC driver.

Note:

Oracle DataSource for Apache Hadoop (OD4H) works with Oracle View and Oracle Tables.

6.2.3.2 SERDE PROPERTIES

Property Use

Property	Use
oracle.hcat.osh.columns.mapping	All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. `WITHSERDEPROPERTIES` also enables the external table definition to refer only to select columns in the actual Oracle table. In other words, not all columns from the Oracle table need to be part of the Hive external table. The ordering of oracle columns in the mapping is the same as ordering of hive columns specified in create table.

oracle.hcat.osh.columns.mapping

All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. WITHSERDEPROPERTIES also enables the external table definition to refer only to select columns in the actual Oracle table. In other words, not all columns from the Oracle table need to be part of the Hive external table. The ordering of oracle columns in the mapping is the same as ordering of hive columns specified in create table.

6.2.4 List of jars in the OD4H package

Oracle DataSource for Apache Hadoop (OD4H) contains the following list of jars.

OD4H consists of the following list of jars.

Table 6-1 List of jars in OD4H

Name of JAR	Use
osh.jar	Contains OracleStorageHandler Implementation
ojdbc8.jar	An OD4H specific JDBC driver (which is optimized with internal calls), used by Spark or Hadoop tasks to connect to the database.
ucp.jar	For creating connection pools in OracleStorageHandler
oraclepki103.jar, osdt_core.jar, osdt_cert.jar, osdt_jce.jar	For Oracle Wallet authentication
orai18n.jar	Oracle Globalization Support
xdb.jar	Oracle XDB jar

6.3 How does OD4H work?

Oracle DataSource for Apache Hadoop (OD4H) does not require creating a new table. You can start working with OD4H using the following steps:

Create a new Oracle table, or, reuse an existing table.
Create the Hive DDL for creating the external table referencing the Oracle Table.
Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.

The following sections show how to create a new Oracle Database Table, and a Hive DDL:

6.3.1 Create a new Oracle Database Table or Reuse an Existing Table

Here is an illustration of a partitioned Oracle table that we will use to demo how partition pruning works:

CREATE TABLE EmployeeData ( Emp_ID NUMBER,
    First_Name VARCHAR2(20),
    Last_Name VARCHAR2(20),
    Job_Title VARCHAR2(40),
    Salary NUMBER)
PARTITION BY RANGE (Salary)
 ( PARTITION salary_1 VALUES LESS THAN (60000)
    TABLESPACE tsa
 , PARTITION salary_2 VALUES LESS THAN (70000)
    TABLESPACE tsb
 , PARTITION salary_3 VALUES LESS THAN (80000)
    TABLESPACE tsc
 , PARTITION salary_4 VALUES LESS THAN (90000)
    TABLESPACE tsd
 , PARTITION salary_5 VALUES LESS THAN (100000)
    TABLESPACE tse
 );

Note:

You can use this syntax for table creation, in the following examples listed in this Book.

Issue queries from Hive, Spark, or any other Hadoop models (including joins with local Hive Tables.)

6.3.2 Hive DDL

In this example, we will associate two Hive external tables to the same Oracle table, using two different split patterns:

SIMPLE_SPLITTER
PARTITION_SPLITTER

Note:

It is possible that the external table has fewer columns than the base Oracle table.

Since columns can have different names, use TBLPROPERTY for mapping with the base table.

In the following examples, we are using the following variables:

connection_string = jdbc:oracle:thin:@localhost:1521/<servicename>

oracle_user=od4h

oracle_pwd=od4h

The following command creates a Hive external table with the default split pattern, that is SIMPLE_SPLITTER.

CREATE EXTERNAL TABLE EmployeeDataSimple (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:od4h}',
 'mapreduce.jdbc.password' = '${hiveconf:od4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData'
);

The following example creates a Hive external table using PARTITION_SPLITTER.

DROP TABLE EmployeeDataPartitioned;
CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
 'mapreduce.jdbc.username' = '${hiveconf:od4h}',
 'mapreduce.jdbc.password' = '${hiveconf:od4h}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.splitterKind' = 'PARTITIONED_TABLE'
);

6.3.3 Create External Tables in Hive

You can create an external table in Hive in the following way:

DROP TABLE employees;
 
CREATE EXTERNAL TABLE employees (
  EMPLOYEE_ID INT,
  FIRST_NAME  STRING,
  LAST_NAME   STRING,
  SALARY      DOUBLE,
  HIRE_DATE   TIMESTAMP,
  JOB_ID      STRING
 )
 
  STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 
WITH SERDEPROPERTIES (
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')
 
  TBLPROPERTIES (
    'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',
    'mapreduce.jdbc.username' = 'hr',
    'mapreduce.jdbc.password' = 'hr',
    'mapreduce.jdbc.input.table.name' = 'EMPLOYEES'
);

Note:

For using OD4H, ensure that the ucp.jar, the ojdbc8.jar file, and the osh.jar file are present in the Hive CLASSPATH environment variable. This is pre-configured on Oracle Big Data Appliance.

To learn more about the CLASSPATH environment variable and other Hive configuration properties, refer to the following sources:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

For Cloudera distribution, refer to the following page that explains the steps for Cloudera Manager:

https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cm_mc_hive_udf.html

For other distributions, refer to the respective documentation on adding additional JAR files to Hive or HiveServer2 environment.

6.4 Features of OD4H

The following topics discuss features of OD4H.

6.4.1 Performance And Scalability Features

Following sections discuss the performance and scalability features of OD4H:

HCatalog stores table metadata from Hive DDL. HiveSQl, Spark SQL and others, then use this metadata while submitting queries.

The Oracle table is divided into granules determined by the splitterKind property. These granules are then read into a split by OracleStorageHandler, by submitting generated queries.

OracleStorageHandler will not have to test all possible query types if the query plan determines which splits need to be scanned.

Figure 6-2 OD4H in a Nutshell

Description of "Figure 6-2 OD4H in a Nutshell"

6.4.1.1 Splitters

While executing a query on a Hive external table through OD4H, the underlying Oracle table is dynamically divided into granules, which correspond to splits on the Hadoop side. Each split is processed by a single map task. With the help of the ORACLE_SPLITTER_KIND property, you can specify how the splits are created. This ensures that the splits are a good match for the physical structure of the target table in Oracle Database.

The different kinds of splitters available are:

SINGLE_SPLITTER

Creates one split for the table. Use SINGLE_SPLITTER where a single task is sufficient to process the query against the entire table.

ROW_SPLITTER

Limits the number of rows per Split. The default number of rows is 1000. You can specify number of rows by setting the oracle.hcat.osh.rowsPerSplit property. The default value of oracle.hcat.osh.maxSplits is 1 when ROW_SPLITTER is used. You can increase this value to enable parallel reads.

Based on the values provided in the rowsPerSplit property, OD4H will divide tables into splits. If the number of splits obtained is higher than the maxSplits, then maxSplits property will be used. The rows per split will be divided accordingly.

Note:

oracle.hcat.osh.rowsPerSplitis used only by ROW_SPLITTER and not any other splitter kind.

BLOCK_SPLITTER

Creates splits based on underlying storage of data blocks. With Block Splitter, you can specify the maximum number of splits to be generated. The default value of oracle.hcat.osh.maxSplits is 1, when BLOCK_SPLITTER is used. You can increase this value to enable parallel reads. BLOCK_SPLITTER requires SELECT privilege on the SYS.DBA.EXTENTS table, granted to the schema containing the Oracle target table. In the event that this permission does not exist, OD4H will use SINGLE_SPLITTER.

Note:

The actual number of splits under BLOCK_SPLITTER may be lesser than the value specified in the oracle.hcat.osh.maxSplits property.

Do not use BLOCK_SPLITTER on partitioned tables or Index Organized tables.

Note:

For ROW_SPLITTER and BLOCK_SPLITTER types, use oracle.hcat.osh.useChunkSplitter to specify splitting mechanism. The default property value is true. This enables creating chunks corresponding to splits using the DBMS_PARALLEL_EXECUTE package. When the property value is false, custom SQL is generated for splitting. Since DBMS_PARALLEL_EXECUTE can only be used for tables and not views, if mapreduce.jdbc.input.table.name points to a view and not a table, then oracle.hcat.osh.useChunkSplitter should be set to false.

PARTITION_SPLITTER

Creates one split per partition. PARTITION_SPLITTER is used by default when the table is partitioned. You can override this setting by specifying ROW_SPLITTER in table properties. With PARTITION_SPLITTER, the default value of oracle.hcat.osh.maxSplits table property is 64.

Following is an illustration of ROW_SPLITTER:

DROP TABLE employees;
 
CREATE EXTERNAL TABLE employees (
  EMPLOYEE_ID INT,
  FIRST_NAME  STRING,
  LAST_NAME   STRING,
  SALARY      DOUBLE,
  HIRE_DATE   TIMESTAMP,
  JOB_ID      STRING
 )
 STORED BY 'oracle.hcat.osh.OracleStorageHandler'
 
WITH SERDEPROPERTIES (
 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')

TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',     
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES',     
'oracle.hcat.osh.splitterKind' = 'ROW_SPLITTER',     
'oracle.hcat.osh.rowsPerSplit' = '1500' 
);

CUSTOM_SPLITTER

Use CUSTOM_SPLITTER If you want to provide a custom split generation mechanism. You can do this using CUSTOM_SPLITTER through oracle.hcat.osh.splitterKind property and a SELECT statement that emits ROWIDs corresponding to start and end of each split in oracle.hcat.osh.chunkSQL.

6.4.1.2 Choosing a Splitter

SINGLE_SPLITTER is used by default if no splitter is specified in the table properties for Hive external table, and the target Oracle table is not partitioned.

For an unpartitioned table, the default value of oracle.hcat.osh.maxSplits will be 1. For partitioned table, the default value of the same will be 64, and the default splitter will be PARTITION_SPLITTER. The default for maxSplits is set to limit the number of connections to the Oracle server. To increase this limit, you must increase the value of oracle.hcat.osh.maxSplits explicitly in hive table properties.

Use the following guidelines while choosing a splitter kind for a hive external table:

Splitter Kind	Use
`SINGLE_SPLITTER`	When no parallelism is required.
`PARTITION_SPLITTER`	Used by default when target table is partitioned
`BLOCK_SPLITTER`	When Oracle user has `SELECT` privilege on `SYS.DBA_EXTENTS`, and target table is not partitioned.
`ROW_SPLITTER`	When Oracle user does not have `SELECT` privilege on `SYS.DBA_EXTENTS`.
`CUSTOM_SPLITTER`	For fine grain control over generated splits.

6.4.1.3 Predicate Pushdown

Predicate Pushdown is an optimization technique, in which you push predicates (WHERE condition) down to be evaluated by Oracle Database at the time of querying. This minimizes the amount of data fetched from Oracle Database to Hive, while performing a query.

Set the configuration property hive.optimize.ppd to either true or false for enabling Predicate Pushdown. The default value on hive-1.1.0 is set to true. Hence, Predicate Pushdown is always performed, unless you want to disable it.

Note:

OD4H does not push down all possible predicates. It considers only the part of the execution plan pertaining to Oracle table declared as external table. OD4H also rewrites sub-queries for the Oracle SQL engine and each split task. At present conditions involving operators >,=,< and != in a single condition over a column (e.g. key > 10) or a combination of multiple conditions separated by AND (e.g. key > 10 AND key < 20 AND key !=17) are pushed down.

Another option to reduce the amount of data fetched from the Oracle Database is to specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions. For instance:

mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.

This will restrict the rows fetched from Oracle Database whenever any query is performed based on the condition specified. The external table that gets created, is analogous to a view on Oracle Database. This approach is only useful when you want to push down complex predicates that cannot be analyzed and automatically pushed down by OD4H.

Note:

Due to incompatibilities between date and timestamp representation in Hive and Oracle, these columns are not pushed down by default in a query. You can enable this with certain limitations by setting the tableproperty oracle.hcat.datetime.pushdown to true. When set to true, the date representation in the query should be in the form YYYY-MM-DD and timetamp should be in the form “YYYY-MM-DD HH:MM:SS” with no decimal places. No other date or timestamp representation is supported when oracle.hcat.datetime.pushdown is set to true.

6.4.1.4 Projection Pushdown

Projection Pushdown is an optimization technique that fetches only the required columns from Oracle Database when a query is performed. If you want to fetch all columns during a query (not recommended), you can disable it by setting the hive.io.file.read.all.columns connection property to true. On Hive–1.1.0, this property is false by default.

6.4.1.5 Partition Pruning

If you refer to Employee Data Partition table, the partitions irrelevant to the query are removed from the partition list. This is done by executing an explain plan on the query to obtain the list of partitions and sub-partitions that are relevant to the query.

Table level partition pruning uses table level predicate pushdown, on the other hand partition pruning at the query level uses query level predicate pushdown.

Partition pruning is active when a SELECT query is run, in which the WHERE clause uses the partitioning key. Following is an example of partition pruning:

To query the partition, where salary is in the above range and prune other partitions, perform the following:

Hive External Table:

CREATE EXTERNAL TABLE EmployeeDataPartitioned (
 Emp_ID int,
 First_Name string,
 Last_Name string,
 Job_Title string,
 Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
     'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
 'mapreduce.jdbc.url' = '${hiveconf:connection_string}',
 'mapreduce.jdbc.username' = '${hiveconf:oracle_user}',
 'mapreduce.jdbc.password' = '${hiveconf:oracle_pwd}',
 'mapreduce.jdbc.input.table.name' = 'EmployeeData',
 'oracle.hcat.osh.oosKind' = 'PARTITIONED_TABLE'
);

The following SELECT statement shows how to query the partition, where salary is between 72000 to 78000, and prunes other partitions:

select * from EmployeeDataPartitioned where salary > 72000 and salary < 78000;

6.4.2 Smart Connection Management

Connection Caching

Each map task runs in its own JVM. Each JVM in turn caches a single connection to the Oracle database that you can reuse within the same query. The Mapper checks the cache before establishing a new connection and caching is not done once the query has completed executing.

Oracle RAC Awareness

JDBC and UCP are aware of various Oracle RAC instances. This can be used to split queries submitted to JDBC. The StorageHandler will depend on listener for load balancing.

Handling Logon Storms

Hadoop allows you to limit the number of mappers attempting to connect to the Database. Hadoop allows you to limit the number of mappers attempting to connect to the Database using oracle.hcat.osh.maxSplits. This parameter controls the degree of concurrency. However, subsequent tasks of the same query are guaranteed to query their table granule as per the System Commit Number (SCN) of the query. This ensures consistency of the result sets.

Database Resident Connection Pooling (DRCP)

It is recommended to configure DRCP for OD4H, and limit the maximum number of concurrent connections to the Oracle Database from OD4H.

Configuring Database Resident Connection Pooling

To configure DRCP, use the following steps:

Login as SYSDBA.
Start the default pool, SYS_DEFAULT_CONNECTION_POOL using DBMS_CONNECTION_POOL.START_POOL with the default settings.

You can use DBMS_CONNECTION_POOL.MINSIZE and DBMS_CONNECTION_POOL.MAXSIZE with the default settings.

See Also:

For more information on configuring DRCP see the Oracle Database Administration Guide

6.4.3 Security Features

Following are the security features of OD4H:

6.4.3.1 Improved Authentication

OD4H uses Oracle JDBC driver for connecting to Oracle Database. It provides all authentication methods supported by Oracle JDBC. OD4H supports authentication through use of basic authentication (user name and password), Oracle Wallet, and Kerberos. You can specify the authentication to be used for a table created in Hive, through the oracle.hcat.osh.authentication table property. This is useful only for strong authentication.

Kerberos
Oracle Wallet
Basic Authentication

Note:

Oracle recommends using strong authentication such as Kerberos.

The various authentication processes are described with examples as follows:

Kerberos

Uses Kerberos credentials of the Hadoop engine process. This principal should have access to the table.

See Also:

Oracle Database JDBC Developer's Guide for information on configuring database for Kerberos and details of client parameters

You can enable Kerberos configuration on Hive, by adding to hive-env.sh the following:

export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.conf=<path to kerberos configuration>

To enable child JVMs to use Kerberos configuration, edit the mapred-site.xml to include the following property on all nodes of the cluster:

<property><name>mapred.child.java.opts</name>  <value>-Djava.security.krb5.conf=<path to kerberos configuration>></value></property>

Enable these configurations on Oracle Big Data Appliance using Cloudera manager..

Following is an illustration of Kerberos authentication:

CREATE EXTERNAL TABLE kerb_example (
id DECIMAL,
name STRING,
salary DECIMAL
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
		'oracle.hcat.osh.columns.mapping' = 'id,name,salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)
(HOST=adc*******.xxxxxx.com)(PORT=5521))(CONNECT_DATA=
(SERVICE_NAME=project_name.xxx.rdbms.xxxx.com)))',
'mapreduce.jdbc.input.table.name' = 'kerb_example',
'mapreduce.jdbc.username' = 'CLIENT@xxxxxx.COM',
'oracle.hcat.osh.authentication' = 'KERBEROS',
'oracle.net.kerberos5_cc_name' = '/tmp/krb5cc_xxxxx',
'java.security.krb5.conf' = '/home/user/kerberos/krb5.conf',
'oracle.hcat.osh.kerb.callback' = 'KrbCallbackHandler',
'sun.security.krb5.principal' = 'CLIENT@xxxxx.COM'
);

The path specified in oracle.security.krb5.conf should be accessible to all nodes of the cluster. These paths should also match with the path of the corresponding properties in Oracle Database sqlnet.ora.The keytab path provided in sqlnet.ora should also be accessible from all nodes of the cluster.

If sun.security.krb5.principal is not specified, OD4H attempts to authenticate using default principal in Credential Cache specified by the oracle.net.kerberos5_cc_name property.

Note:

The callback is called only if the principal cannot be authenticated using a ticket obtained from the credential cache specified in oracle.net.kerberos5_cc_nameproperty.

A simple callback handler class is described as follows (The callback class must be available to the Hive classpath):

class KrbCallbackHandler 
		implements CallbackHandler{

@Override
public void handle(Callback[] callbacks) throws IOException,
		UnsupportedCallbackException{
for (int i = 0; i < callbacks.length; i++){
	if (callbacks[i] instanceof PasswordCallback){
	PasswordCallback pc = (PasswordCallback)callbacks[i];
	System.out.println("set password to 'welcome'");
	pc.setPassword((new String("welcome")).toCharArray());
} else if (callbacks[i] instanceof NameCallback) {
	((NameCallback)callbacks[i]).setName("client@xxxxx.COM");
}else{
	throw new UnsupportedCallbackException(callbacks[i],
			"Unrecognized Callback");
		}
	}
}

Oracle Wallet

The wallet should be available in the OS environment of each engine process. Following is an illustration of how to add Wallet authentication:

CREATE EXTERNAL TABLE wallet_example (
	id DECIMAL,
	salary DECIMAL
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
	name STRING,
		'oracle.hcat.osh.columns.mapping' = 'id,name,salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:/@inst1',
'mapreduce.jdbc.input.table.name' = 'wallet_example',
'oracle.hcat.osh.authentication' = 'ORACLE_WALLET',
'oracle.net.tns_admin' = '/scratch/user/view_storage/user_project6/work',
'oracle.net.wallet_location' = '/scratch/user/view_storage/user_project6/work'
);

Note:

The paths specified in oracle.net.tns_admin and oracle.net.wallet_location should be accessible from all nodes of the cluster.

Basic Authentication (for demonstration purposes only)
This is stored in HCatalog TBLPROPERTIES or supplied on HiveQL SELECT statement.

When Basic Authentication is used, then the user name and password for Oracle Schema are specified in Hive external Table properties.

Note:
Oracle does not recommend this authentication process in the production environment because the password is stored unmasked in HCatalog.

6.5 Use HiveQL with OD4H

HiveQL is a SQL like language provided by Hive. It can be used to query hive external tables created using OD4H.

To track the status of a running query on Oracle Big Data Appliance, you can run the Resource Manager web interface in your browser in the following way:

http://<domain>:8088/<cluster_name>

You can also see the logs of a query in Cloudera Manager, which also indicates the actual query sent to Oracle Database corresponding to your query on HiveQL. Hive and OD4H use the Simple Logging Facade for Java (SLF4J) framework for logging. You can use the logging configuration techniques of Hive for controlling the logging level of classes related to OD4H.

6.6 Use Spark SQL with OD4H

Spark SQL enables you to execute relational queries expressed in SQL and HiveSQL using Spark. Spark SQL mixex SQL queries with programmatic data manipulations supported by RDDs (Resilient Distributed Datasets) in Java, Python and Scala, with a single application. You can also use it to query external tables created using OD4H.

Before running the queries, perform the following steps to configure Spark SQL on Oracle Big Data Appliance:

Add ojdbc7.jar and osh.jar to CLASSPATH in /usr/lib/spark/bin/compute-classpath.sh

CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/osh.jar"
CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/ojdbc7.jar"

Edit SPARK_HOME in /usr/lib/spark/conf/spark-env.sh
```
export SPARK_HOME=/usr/lib/spark:/etc/hive/conf
```

You will need to specify additional environment variables in /usr/lib/spark/conf/spark-env.sh.

The Hive related variables that you need to substitute for running the code example, are marked in bold. The file already contains Hadoop related environment variables.

export DEFAULT_HADOOP=/usr/lib/hadoop
export DEFAULT_HIVE=/usr/lib/hive
export DEFAULT_HADOOP_CONF=/etc/hadoop/conf
export DEFAULT_HIVE_CONF=/etc/hive/conf
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP}
export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$DEFAULT_HADOOP_CONF}
export HIVE_CONF_DIR=${HIVE_CONF_DIR:-$DEFAULT_HIVE_CONF}
 
CLASSPATH="$CLASSPATH:$HIVE_CONF_DIR"
CLASSPATH="$CLASSPATH:$HADOOP_CONF_DIR"
 
if [ "x" != "x$YARN_CONF_DIR" ]; then
  CLASSPATH="$CLASSPATH:$YARN_CONF_DIR"
fi
 
# Let's make sure that all needed hadoop libs are added properly
CLASSPATH="$CLASSPATH:$HADOOP_HOME/client/*"
CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/*"
CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"

Once configured, you can run some sample queries on spark SQL using scripts included in the /shell/*QuerySpark.sh demo file. By default, Spark prints queries on the console. To modify this behavior you can edit the spark logging configuration file /usr/lib/spark/conf/log4j.properties.

The log printed by OracleRecordReader shows the actual query sent to Oracle Database, as follows:

15/03/18 10:36:08 INFO OracleRecordReader: Reading records from Oracle Table using Query: SELECT FIRST_NAME, LAST_NAME, EMP_ID FROM EmployeeData

6.7 Writing Back to Oracle Database

In the typical use case for OD4H, you store the result sets of Hive or Spark SQL queries back to Oracle Database. OD4H implements OutputFormat to enable you to write back to an Oracle Database table from Hadoop.

After the data is inserted into an Oracle Database table, you can then use your favorite business intelligence tools for further data mining

The following query is from the OD4H demo code samples. It demonstrates writing back to an external table called EmployeeBonusReport.

Example 6-1 Writing Hive or Spark Result Sets Back to Oracle Database

INSERT INTO EmployeeBonusReport 
             SELECT EmployeeDataSimple.First_Name, EmployeeDataSimple.Last_Name,
                    EmployeeBonus.bonus 
             FROM EmployeeDataSimple JOIN EmployeeBonus ON
                        (EmployeeDataSimple.Emp_ID=EmployeeBonus.Emp_ID)
             WHERE salary > 70000 and bonus > 7000"