The common data architecture in most companies nowadays generally comprises of the following components:
Oracle Database(s) for operational, transactional, and master data, that is shared business object such as customers, products, employees and so on
Big Data
Hadoop applications such as Master Data Management (MDM), Events processing, and others, need access to data in both Hadoop storages (such as HDFS and NoSQL Database as a landing point for weblogs, and so on) and Oracle Database (as the reliable and auditable source of truth). There are two approaches to process such data that reside in both Hadoop storage and Oracle Database:
ETL Copy using tools such as Oracle's Copy to BDA
Direct Access using Oracle Big Data SQL and Oracle DataSource for Apache Hadoop (OD4H).
In this chapter, we will discuss Oracle DataSource for Apache Hadoop (OD4H).
Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.
This section discusses the following concepts:
Hadoop 2.x architecture decouples compute engines from cluster resources management and storages. It enables:
A variety of SQL query engines. For instance, Hive SQL, Spark SQL, Big Data SQL, and so on.
A variety of programmatic compute engines. For instance, MapReduce, Pig, Storm, Solr, Cascading, and so on.
Elastic allocation of compute resources (CPU, memory) through YARN.
A variety of data stores such as HDFS, NoSQL, as well as remote storages through HCatalog, InputFormat, OutputFormat and StorageHandler interfaces.
Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.
Following is an illustration of Hadoop 2.0 Architecture:
OD4H enables current and ad-hoc querying. This makes querying data faster and more secure. You can query data directly and retrieve only the data that you need, when you need it.
OD4H also provides Oracle’s end-to-end security. This includes Identity Management, Column Masking, and Label and Row Security.
OD4H also furnishes direct access for Hadoop and Spark APIs such as Pig, MapReduce and others.
External Tables turn Oracle tables into Hadoop and/or Spark datasources. The DDL for declaring External Tables is as follows:
CREATE[TEMPORARY] EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENTcol_comment],...)] [COMMENT table_comment] STORED BY 'oracle.hcat.osh.OracleStorageHandler' [WITHSERDEPROPERTIES(...)] [TBLPROPERTIES (property_name=property_value,...)] data_type |SMALLINT |INT |BIGINT |BOOLEAN |FLOAT |DOUBLE |STRING |BINARY |TIMESTAMP |DECIMAL |DECIMAL(precision,scale) |VARCHAR |CHAR
See Also:
Refer the following link for Hive External Table syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableNote:
Oracle supports only primitive types.
Oracle Data Type | Hive Data Type |
|
INT when the scale is 0 and the precision is less than 10. BIGNIT when the scale is 0 and precision is less than 19. DECIMAL when the scale is greater than 0 or the precision is greater than 19. |
|
STRING |
|
DOUBLE |
|
FLOAT |
|
BINARY |
|
CHAR |
|
VARCHAR |
|
BINARY |
DATE |
TIMESTAMP |
TIMESTAMP |
TIMESTAMP |
|
Unsupported |
RAW |
BINARY |
The properties of external tables can be described as follows:
Property | Use |
---|---|
oracle.hcat.osh.columns.mapping |
Comma separated list to specify mapping between Hive columns and Oracle table columns. All external tables using OracleStorageHandler must define this. |
mapreduce.jdbc.url |
Connection URL to connect to the database |
mapreduce.jdbc.username |
Connection user name to connect to the database |
mapreduce.jdbc.password |
Connection password to connect to the database |
mapreduce.jdbc.input.table.name |
Oracle table name |
mapreduce.jdbc.input conditions |
To be used for querying the database. Must be used for query pushdown. |
mapreduce.jdbc.input.query |
To be used for querying the database. Query should be used only when a subset of the columns is selected. |
mapreduce.jdbc.input.orderby |
|
oracle.hcat.osh.splitterKind |
To be used to specify how OracleStorageHandler must create splits, so that they are a good match for the physical structure of the target table in Oracle Database. The splitter kind applicable could be |
oracle.hcat.osh.rowsPerSplit |
Used only when |
oracle.hcat.osh.authentication |
Authentication method used to connect to Oracle Database. Can be |
sun.security.krb5.principal |
Kerberos principal. Used only when |
oracle.hcat.osh.kerb.callback |
Callback for Kerberos authentication. Used only when Kerberos authentication is applied. |
oracle.hcat.osh.maxSplits |
Maximum number of splits for any splitter kind |
oracle.hcat.osh.useChunkSplitter |
Use chunk based ROW_SPLITTER and BLOCK_SPLITTER that use DBMS_PARALLEL_EXECUTE package to divide table into chunks that will map to hadoop splits.The default value is set to ‘true’. |
oracle.hcat.osh.chunkSQL |
Used by CUSTOM_SPLITTER to create splits. The SQL string should be a SELECT statement that returns range of each chunk must have two columns: start_id and end_id The columns must be of ROWID type. |
oracle.hcat.osh.useOracleParallelism |
When configured, parallel queries will be executed while fetching rows from Oracle. Default value: ‘false ’ |
oracle.hcat.osh.fetchSize |
JDBC fetchsize for generated select queries used to fetch rows. Default value: 10 (set by Oracle JDBC Driver) |
Note:
In addition to the above, any JDBC connection properties (oracle.jdbc.*
and oracle.net.*
) can be specified as TBLPROPERTIES
. They will be used while establishing connection to Oracle Database using JDBC driver.
Note:
Oracle DataSource for Apache Hadoop (OD4H) works with Oracle View and Oracle Tables.
Property | Use |
---|---|
oracle.hcat.osh.columns.mapping |
All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. |
Oracle DataSource for Apache Hadoop (OD4H) contains the following list of jars.
OD4H consists of the following list of jars.
Table 9-1 List of jars in OD4H
Name of JAR | Use |
---|---|
osh.jar | Contains OracleStorageHandler Implementation |
ojdbc7.jar | An OD4H specific JDBC driver (which is optimized with internal calls), used by Spark or Hadoop tasks to connect to the database. |
ucp.jar | For creating connection pools in OracleStorageHandler |
oraclepki103.jar, osdt_core.jar, osdt_cert.jar, osdt_jce.jar | For Oracle Wallet authentication |
orai18n.jar | Oracle Globalization Support |
xdb.jar | Oracle XDB jar |
Oracle DataSource for Apache Hadoop (OD4H) does not require creating a new table. You can start working with OD4H using the following steps:
Create a new Oracle table, or, reuse an existing table.
Create the Hive DDL for creating the external table referencing the Oracle Table.
Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.
The following sections show how to create a new Oracle Database Table, and a Hive DDL:
Here is an illustration of a partitioned Oracle table that we will use to demo how partition pruning works:
CREATE TABLE EmployeeData ( Emp_ID NUMBER, First_Name VARCHAR2(20), Last_Name VARCHAR2(20), Job_Title VARCHAR2(40), Salary NUMBER) PARTITION BY RANGE (Salary) ( PARTITION salary_1 VALUES LESS THAN (60000) TABLESPACE tsa , PARTITION salary_2 VALUES LESS THAN (70000) TABLESPACE tsb , PARTITION salary_3 VALUES LESS THAN (80000) TABLESPACE tsc , PARTITION salary_4 VALUES LESS THAN (90000) TABLESPACE tsd , PARTITION salary_5 VALUES LESS THAN (100000) TABLESPACE tse );
Note:
You can use this syntax for table creation, in the following examples listed in this Book.
Issue queries from Hive, Spark, or any other Hadoop models (including joins with local Hive Tables.)
In this example, we will associate two Hive external tables to the same Oracle table, using two different split patterns:
SIMPLE_SPLITTER
PARTITION_SPLITTER
Note:
It is possible that the external table has fewer columns than the base Oracle table.Since columns can have different names, use TBLPROPERTY
for mapping with the base table.
In the following examples, we are using the following variables:
connection_string = jdbc:oracle:thin:@localhost:1521/<servicename>
oracle_user=od4h
oracle_pwd=od4h
The following command creates a Hive external table with the default split pattern, that is SIMPLE_SPLITTER
.
CREATE EXTERNAL TABLE EmployeeDataSimple ( Emp_ID int, First_Name string, Last_Name string, Job_Title string, Salary int ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}', 'mapreduce.jdbc.username' = '${hiveconf:od4h}', 'mapreduce.jdbc.password' = '${hiveconf:od4h}', 'mapreduce.jdbc.input.table.name' = 'EmployeeData' );
The following example creates a Hive external table using PARTITION_SPLITTER
.
DROP TABLE EmployeeDataPartitioned; CREATE EXTERNAL TABLE EmployeeDataPartitioned ( Emp_ID int, First_Name string, Last_Name string, Job_Title string, Salary int ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}', 'mapreduce.jdbc.username' = '${hiveconf:od4h}', 'mapreduce.jdbc.password' = '${hiveconf:od4h}', 'mapreduce.jdbc.input.table.name' = 'EmployeeData', 'oracle.hcat.osh.splitterKind' = 'PARTITIONED_TABLE' );
You can create an external table in Hive in the following way:
DROP TABLE employees; CREATE EXTERNAL TABLE employees ( EMPLOYEE_ID INT, FIRST_NAME STRING, LAST_NAME STRING, SALARY DOUBLE, HIRE_DATE TIMESTAMP, JOB_ID STRING ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl', 'mapreduce.jdbc.username' = 'hr', 'mapreduce.jdbc.password' = 'hr', 'mapreduce.jdbc.input.table.name' = 'EMPLOYEES' );
Note:
Ensure thatucp.jar
, ojdbc8.jar
and osh.jar
are present in the Hive CLASSPATH
, for using OD4H. This is pre-configured on BDA. .
To learn more about CLASSPATH
and other Hive configuration properties, refer the following sources:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
The following topics discuss features of OD4H.
Following sections discuss the performance and scalability features of OD4H:
HCatalog stores table metadata from Hive DDL. HiveSQl, Spark SQL and others, then use this metadata while submitting queries.
The Oracle table is divided into granules determined by the splitterKind
property. These granules are then read into a split by OracleStorageHandler
, by submitting generated queries.
OracleStorageHandler
will not have to test all possible query types if the query plan determines which splits need to be scanned.
While executing a query on a Hive external table through OD4H, the underlying Oracle table is dynamically divided into granules, which correspond to splits on the Hadoop side. Each split is processed by a single map task. With the help of the ORACLE_SPLITTER_KIND
property, you can specify how the splits are created. This ensures that the splits are a good match for the physical structure of the target table in Oracle Database.
The different kinds of splitters available are:
SINGLE_SPLITTER
Creates one split for the table. Use SINGLE_SPLITTER
where a single task is sufficient to process the query against the entire table.
ROW_SPLITTER
Limits the number of rows per Split. The default number of rows is 1000. You can specify number of rows by setting the oracle.hcat.osh.rowsPerSplit
property. The default value of oracle.hcat.osh.maxSplits
is 1 when ROW_SPLITTER
is used. You can increase this value to enable parallel reads.
rowsPerSplit
property, OD4H will divide tables into splits. If the number of splits obtained is higher than the maxSplits
, then maxSplits
property will be used. The rows per split will be divided accordingly.
Note:
oracle.hcat.osh.rowsPerSplit
is used only by ROW_SPLITTER
and not any other splitter kind.BLOCK_SPLITTER
oracle.hcat.osh.maxSplits
is 1, when BLOCK_SPLITTER
is used. You can increase this value to enable parallel reads. BLOCK_SPLITTER
requires SELECT
privilege on the SYS.DBA.EXTENTS
table, granted to the schema containing the Oracle target table. In the event that this permission does not exist, OD4H will use SINGLE_SPLITTER.
Note:
The actual number of splits underBLOCK_SPLITTER
may be lesser than the value specified in the oracle.hcat.osh.maxSplits
property.
Do not use BLOCK_SPLITTER on partitioned tables or Index Organized tables.
Note:
ForROW_SPLITTER
and BLOCK_SPLITTER
types, use oracle.hcat.osh.useChunkSplitter
to specify splitting mechanism. The default property value is true
. This enables creating chunks corresponding to splits using the DBMS_PARALLEL_EXECUTE
package. When the property value is false
, custom SQL is generated for splitting. Since DBMS_PARALLEL_EXECUTE
can only be used for tables and not views, if mapreduce.jdbc.input.table.name
points to a view and not a table, then oracle.hcat.osh.useChunkSplitter
should be set to false
.PARTITION_SPLITTER
Creates one split per partition. PARTITION_SPLITTER
is used by default when the table is partitioned. You can override this setting by specifying ROW_SPLITTER
in table properties. With PARTITION_SPLITTER
, the default value of oracle.hcat.osh.maxSplits
table property is 64.
Following is an illustration of ROW_SPLITTER
:
DROP TABLE employees; CREATE EXTERNAL TABLE employees ( EMPLOYEE_ID INT, FIRST_NAME STRING, LAST_NAME STRING, SALARY DOUBLE, HIRE_DATE TIMESTAMP, JOB_ID STRING ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl', 'mapreduce.jdbc.username' = 'hr', 'mapreduce.jdbc.password' = 'hr', 'mapreduce.jdbc.input.table.name' = 'EMPLOYEES', 'oracle.hcat.osh.splitterKind' = 'ROW_SPLITTER', 'oracle.hcat.osh.rowsPerSplit' = '1500' );
CUSTOM_SPLITTER
Use CUSTOM_SPLITTER
If you want to provide a custom split generation mechanism. You can do this using CUSTOM_SPLITTER
through oracle.hcat.osh.splitterKind
property and a SELECT
statement that emits ROWIDs corresponding to start and end of each split in oracle.hcat.osh.chunkSQL
.
SINGLE_SPLITTER
is used by default if no splitter is specified in the table properties for Hive external table, and the target Oracle table is not partitioned.
For an unpartitioned table, the default value of oracle.hcat.osh.maxSplits
will be 1. For partitioned table, the default value of the same will be 64, and the default splitter will be PARTITION_SPLITTER
. The default for maxSplits
is set to limit the number of connections to the Oracle server. To increase this limit, you must increase the value of oracle.hcat.osh.maxSplits
explicitly in hive table properties.
Use the following guidelines while choosing a splitter kind for a hive external table:
Splitter Kind | Use |
---|---|
|
When no parallelism is required. |
|
Used by default when target table is partitioned |
|
When Oracle user has |
|
When Oracle user does not have |
|
For fine grain control over generated splits. |
Predicate Pushdown is an optimization technique, in which you push predicates (WHERE
condition) down to be evaluated by Oracle Database at the time of querying. This minimizes the amount of data fetched from Oracle Database to Hive, while performing a query.
Set the configuration property hive.optimize.ppd
to either true
or false
for enabling Predicate Pushdown. The default value on hive-1.1.0 is set to true
. Hence, Predicate Pushdown is always performed, unless you want to disable it.
Note:
OD4H does not push down all possible predicates. It considers only the part of the execution plan pertaining to Oracle table declared as external table. OD4H also rewrites sub-queries for the Oracle SQL engine and each split task. At present conditions involving operators >,=,< in a single condition over a column (e.g. key > 10) or a combination of multiple conditions separated by AND (e.g. key > 10 AND key < 20 AND key !=17) are pushed down.
Another option to reduce the amount of data fetched from the Oracle Database is to specify a condition at the time of table creation, using TBLPROPERTY
mapreduce.jdbc.input.conditions
. For instance:
mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.
This will restrict the rows fetched from Oracle Database whenever any query is performed based on the condition specified. The external table that gets created, is analogous to a view on Oracle Database. This approach is only useful when you want to push down complex predicates that cannot be analyzed and automatically pushed down by OD4H.
Table Level Predicate Pushdown
For Table Level Predicate Pushdown to be enabled, you must specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions.
Following is an illustration:
mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.
This will restrict the rows fetched from Oracle Database when any query is performed based on the condition specified. The table created will be analogous to a view on Oracle database.
However, Table Level Predicate Pushdown is ignored when a predicate (aWHERE
clause) is specified in the query.
Projection Pushdown is an optimization technique that fetches only the required columns from Oracle Database when a query is performed. If you want to fetch all columns during a query (not recommended), you can disable it by setting the hive.io.file.read.all.columns
connection property to true
. On Hive–1.1.0, this property is false
by default.
If you refer to Employee Data Partition table, the partitions irrelevant to the query are removed from the partition list. This is done by executing an explain plan on the query to obtain the list of partitions and sub-partitions that are relevant to the query.
Table level partition pruning uses table level predicate pushdown, on the other hand partition pruning at the query level uses query level predicate pushdown.
Partition pruning is active when a SELECT
query is run, in which the WHERE
clause uses the partitioning key. Following is an example of partition pruning:
To query the partition, where salary is in the above range and prune other partitions, perform the following:
Hive External Table:
CREATE EXTERNAL TABLE EmployeeDataPartitioned ( Emp_ID int, First_Name string, Last_Name string, Job_Title string, Salary int ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = '${hiveconf:connection_string}', 'mapreduce.jdbc.username' = '${hiveconf:oracle_user}', 'mapreduce.jdbc.password' = '${hiveconf:oracle_pwd}', 'mapreduce.jdbc.input.table.name' = 'EmployeeData', 'oracle.hcat.osh.oosKind' = 'PARTITIONED_TABLE' );
The following SELECT
statement shows how to query the partition, where salary is between 72000 to 78000, and prunes other partitions:
select * from EmployeeDataPartitioned where salary > 72000 and salary < 78000;
Connection Caching
Each map task runs in its own JVM. Each JVM in turn caches a single connection to the Oracle database that you can reuse within the same query. The Mapper checks the cache before establishing a new connection and caching is not done once the query has completed executing.
Oracle RAC Awareness
JDBC and UCP are aware of various Oracle RAC instances. This can be used to split queries submitted to JDBC. The StorageHandler will depend on listener for load balancing.
Handling Logon Storms
Hadoop allows you to limit the number of mappers attempting to connect to the Database. Hadoop allows you to limit the number of mappers attempting to connect to the Database using oracle.hcat.osh.maxSplits
. This parameter controls the degree of concurrency. However, subsequent tasks of the same query are guaranteed to query their table granule as per the System Commit Number (SCN) of the query. This ensures consistency of the result sets.
Database Resident Connection Pooling (DRCP)
It is recommended to configure DRCP for OD4H, and limit the maximum number of concurrent connections to the Oracle Database from OD4H.
To configure DRCP, use the following steps:
Login as SYSDBA
.
Start the default pool, SYS_DEFAULT_CONNECTION_POOL
using DBMS_CONNECTION_POOL.START_POOL
with the default settings.
You can use DBMS_CONNECTION_POOL.MINSIZE
and DBMS_CONNECTION_POOL.MAXSIZE
with the default settings.
Note:
Oracle Database Administrator’s Guide for more information on configuring DRCP.Following are the security features of OD4H:
OD4H uses Oracle JDBC driver for connecting to Oracle Database. It provides all authentication methods supported by Oracle JDBC. OD4H supports authentication through use of basic authentication (username and password), Oracle Wallet, and Kerberos. You can specify the authentication to be used for a table created in Hive, through the oracle.hcat.osh.authentication
table property. This is useful only for strong authentication.
Kerberos
Oracle Wallet
Basic Authentication
Note:
Oracle recommends using strong authentication such as Kerberos.The various authentication processes are described with examples as follows:
Kerberos
Uses Kerberos credentials of the Hadoop engine process. This principal
should have access to the table.
See Also:
Oracle Database JDBC Developer's Guide for information on configuring database for Kerberos and details of client parameters
You can enable Kerberos configuration on Hive, by adding to hive-env.sh
the following:
export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.conf=<path to kerberos configuration>
To enable child JVMs to use Kerberos configuration, edit the mapred-site.xml
to include the following property on all nodes of the cluster:
<property><name>mapred.child.java.opts</name> <value>-Djava.security.krb5.conf=<path to kerberos configuration>></value></property>
Enable these configurations on BDA using Cloudera manager..
Following is an illustration of Kerberos authentication:
CREATE EXTERNAL TABLE kerb_example ( id DECIMAL, name STRING, salary DECIMAL ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'id,name,salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp) (HOST=adc*******.xxxxxx.com)(PORT=5521))(CONNECT_DATA= (SERVICE_NAME=project_name.xxx.rdbms.xxxx.com)))', 'mapreduce.jdbc.input.table.name' = 'kerb_example', 'mapreduce.jdbc.username' = 'CLIENT@xxxxxx.COM', 'oracle.hcat.osh.authentication' = 'KERBEROS', 'oracle.net.kerberos5_cc_name' = '/tmp/krb5cc_xxxxx', 'java.security.krb5.conf' = '/home/user/kerberos/krb5.conf', 'oracle.hcat.osh.kerb.callback' = 'KrbCallbackHandler', 'sun.security.krb5.principal' = 'CLIENT@xxxxx.COM' );
The path specified in oracle.security.krb5.conf
should be accessible to all nodes of the cluster. These paths should also match with the path of the corresponding properties in Oracle Database sqlnet.ora.
The keytab
path provided in sqlnet.ora
should also be accessible from all nodes of the cluster.
If sun.security.krb5.principal
is not specified, OD4H will attempt to authenticate using default principal in Credential Cache specified by the oracle.net.kerberos5_cc_name
property.
Note:
The callback
will be called only if the principal
cannot be authenticated using a ticket obtained from the credential cache specified in oracle.net.kerberos5_cc_nameproperty
.
A simple callback handler class is described as follows (The callback class must be available to the hive classpath):
class KrbCallbackHandler implements CallbackHandler{ @Override public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException{ for (int i = 0; i < callbacks.length; i++){ if (callbacks[i] instanceof PasswordCallback){ PasswordCallback pc = (PasswordCallback)callbacks[i]; System.out.println("set password to 'welcome'"); pc.setPassword((new String("welcome")).toCharArray()); } else if (callbacks[i] instanceof NameCallback) { ((NameCallback)callbacks[i]).setName("client@xxxxx.COM"); }else{ throw new UnsupportedCallbackException(callbacks[i], "Unrecognized Callback"); } } }
Oracle Wallet
The wallet should be available in the OS environment of each engine process. Following is an illustration of how to add Wallet authentication:
CREATE EXTERNAL TABLE wallet_example ( id DECIMAL, name STRING, salary DECIMAL ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'id,name,salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:/@inst1', 'mapreduce.jdbc.input.table.name' = 'wallet_example', 'oracle.hcat.osh.authentication' = 'ORACLE_WALLET', 'oracle.net.tns_admin' = '/scratch/user/view_storage/user_project6/work', 'oracle.net.wallet_location' = '/scratch/user/view_storage/user_project6/work' );
Note:
The paths specified inoracle.net.tns_admin
and oracle.net.wallet_location
should be accessible from all nodes of the cluster.See Also:
Managing the Secure External Password Store for Password Credentials section in the Oracle Database Security Guide.
Basic Authentication (for demo purposes only)
This is stored in HCatalog TBLPROPERTIES
or supplied on HiveQL SELECT
statement.
When Basic Authentication is used, the username and password for Oracle Schema is specified in Hive external Table properties.
Note:
Oracle does not recommend this in the production environment, since the password is stored in clear in HCatalog.
HiveQL is a SQL like language provided by Hive. It can be used to query hive external tables created using OD4H.
You can run the Resource Manager web interface in your browser (http://bigdatalite.localdomain:8088/cluster
), to track the status of a running query on BDA.
You can also see the logs of a query in Cloudera Manager, which also indicates the actual query sent to Oracle Database corresponding to your query on HiveQL. Hive and OD4H use slf4j framework for logging. You can control logging level for OD4H related classes using logging configuration techniques of Hive.
Spark SQL enables relational queries expressed in SQL and HiveSQL to be executed using Spark. Spark SQL allows you to mix SQL queries with programmatic data manipulations supported by RDDs (Resilient Distributed Datasets) in Java, Python and Scala, with a single application.
Spark SQL enables you to submit relational queries using SQL or HiveQL. You can also use it to query external tables created using OD4H.
Perform the following steps to configure Spark-SQL on BigDataLite-4.2 VM, before running queries:
Add ojdbc7.jar
and osh.jar
to CLASSPATH in /usr/lib/spark/bin/compute-classpath.sh
CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/osh.jar" CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/ojdbc7.jar"
Edit SPARK_HOME
in /usr/lib/spark/conf/spark-env.sh
export SPARK_HOME=/usr/lib/spark:/etc/hive/conf
You will need to specify additional environment variables in /usr/lib/spark/conf/spark-env.sh
.
The Hive related variables that need to be added are marked in bold. The file already contains Hadoop related environment variables.
export DEFAULT_HADOOP=/usr/lib/hadoop export DEFAULT_HIVE=/usr/lib/hive export DEFAULT_HADOOP_CONF=/etc/hadoop/conf export DEFAULT_HIVE_CONF=/etc/hive/conf export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP} export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs} export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoop-mapreduce} export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$DEFAULT_HADOOP_CONF} export HIVE_CONF_DIR=${HIVE_CONF_DIR:-$DEFAULT_HIVE_CONF} CLASSPATH="$CLASSPATH:$HIVE_CONF_DIR" CLASSPATH="$CLASSPATH:$HADOOP_CONF_DIR" if [ "x" != "x$YARN_CONF_DIR" ]; then CLASSPATH="$CLASSPATH:$YARN_CONF_DIR" fi # Let's make sure that all needed hadoop libs are added properly CLASSPATH="$CLASSPATH:$HADOOP_HOME/client/*" CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/*" CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"
Once configured, you can run some sample queries on spark SQL using scripts included in demo:/shell/*QuerySpark.sh
. By default, Spark prints queries on the console. To modify this behavior you can edit the spark logging configuration file /usr/lib/spark/conf/log4j.properties
.
The log printed by OracleRecordReader shows the actual query sent to Oracle Database, as follows:
15/03/18 10:36:08 INFO
OracleRecordReader: Reading records from Oracle Table using Query: SELECT
FIRST_NAME
, LAST_NAME
, EMP_ID
FROM
EmployeeData
In the typical use case for OD4H, you store the result sets of Hive or Spark SQL queries back to Oracle Database. OD4H implements OutputFormat to enable you to write back to an Oracle Database table from Hadoop.
After the data is inserted into an Oracle Database table, you can then use your favorite business intelligence tools for further data mining
The following query is from the OD4H demo code samples. It demonstrates writing back to an external table called EmployeeBonusReport.
Example 9-1 Writing Hive or Spark Result Sets Back to Oracle Database
INSERT INTO EmployeeBonusReport SELECT EmployeeDataSimple.First_Name, EmployeeDataSimple.Last_Name, EmployeeBonus.bonus FROM EmployeeDataSimple JOIN EmployeeBonus ON (EmployeeDataSimple.Emp_ID=EmployeeBonus.Emp_ID) WHERE salary > 70000 and bonus > 7000"