6 Oracle DataSource for Apache Hadoop (OD4H)
6.1 Operational Data, Big Data and Requirements
The typical data architecture in most companies comprises of the following components:
-
Oracle Database for operational, transactional, and master data, which is shared business object
-
Big Data
Hadoop applications (such as Master Data Management (MDM) or Events processing) need access to data in both Hadoop storages (such as HDFS and NoSQL Database as a landing point for Web logs) and Oracle Database (as the reliable and auditable source of truth). You can adopt one of the following approaches to process such data that reside in both Hadoop storage and Oracle Database:
-
Perform an Extract, Transform, and Load (ETL) Copy using tools such as Copy to Hadoop tool of Oracle
-
Access the data directly using Oracle Big Data SQL and Oracle DataSource for Apache Hadoop (OD4H)
This chapter discusses Oracle DataSource for Apache Hadoop (OD4H).
6.2 Overview of Oracle DataSource for Apache Hadoop (OD4H)
Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.
This section discusses the following concepts:
6.2.1 Opportunity with Hadoop 2.x
Hadoop 2.x architecture decouples compute engines from cluster resources management and storages. It enables:
-
A variety of SQL query engines. For instance, Hive SQL, Spark SQL, Big Data SQL, and so on.
-
A variety of programmatic compute engines. For instance, MapReduce, Pig, Storm, Solr, Cascading, and so on.
-
Elastic allocation of compute resources (CPU, memory) through YARN.
-
A variety of data stores such as HDFS, NoSQL, as well as remote storages through HCatalog, InputFormat, OutputFormat and StorageHandler interfaces.
Oracle DataSource for Apache Hadoop (OD4H) is the storage handler for Oracle Database that uses HCatalog and InputFormat.
Following is an illustration of Hadoop 2.0 Architecture:
6.2.2 Oracle Tables as Hadoop Data Source
OD4H enables current and ad-hoc querying. This makes querying data faster and more secure. You can query data directly and retrieve only the data that you need, when you need it.
OD4H also provides Oracle’s end-to-end security. This includes Identity Management, Column Masking, and Label and Row Security.
OD4H also furnishes direct access for Hadoop and Spark APIs such as Pig, MapReduce and others.
6.2.3 External Tables
External Tables turn Oracle tables into Hadoop and/or Spark datasources. The DDL for declaring External Tables is as follows:
CREATE[TEMPORARY] EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENTcol_comment],...)]
[COMMENT table_comment]
STORED BY 'oracle.hcat.osh.OracleStorageHandler' [WITHSERDEPROPERTIES(...)]
[TBLPROPERTIES (property_name=property_value,...)]
data_type
|SMALLINT
|INT
|BIGINT
|BOOLEAN
|FLOAT
|DOUBLE
|STRING
|BINARY
|TIMESTAMP
|DECIMAL
|DECIMAL(precision,scale)
|VARCHAR
|CHARSee Also:
Refer the following link for Hive External Table syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableNote:
Oracle supports only primitive types.
| Oracle Data Type | Hive Data Type |
|
|
INT when the scale is 0 and the precision is less than 10. BIGNIT when the scale is 0 and precision is less than 19. DECIMAL when the scale is greater than 0 or the precision is greater than 19. |
|
|
STRING |
|
|
DOUBLE |
|
|
FLOAT |
|
|
BINARY |
|
|
CHAR |
|
|
VARCHAR |
|
|
BINARY |
DATE |
TIMESTAMP |
TIMESTAMP |
TIMESTAMP |
|
|
Unsupported |
RAW |
BINARY |
The properties of external tables can be described as follows:
6.2.3.1 TBLPROPERTIES
| Property | Use |
|---|---|
|
oracle.hcat.osh.columns.mapping |
Comma separated list to specify mapping between Hive columns and Oracle table columns. All external tables using OracleStorageHandler must define this. |
|
mapreduce.jdbc.url |
Connection URL to connect to the database |
|
mapreduce.jdbc.username |
Connection user name to connect to the database |
|
mapreduce.jdbc.password |
Connection password to connect to the database |
|
mapreduce.jdbc.input.table.name |
Oracle table name |
|
mapreduce.jdbc.input conditions |
To be used for querying the database. Must be used for query pushdown. |
|
mapreduce.jdbc.input.query |
To be used for querying the database. Query should be used only when a subset of the columns is selected. |
|
mapreduce.jdbc.input.orderby |
|
|
oracle.hcat.osh.splitterKind |
To be used to specify how OracleStorageHandler must create splits, so that they are a good match for the physical structure of the target table in Oracle Database. The splitter kind applicable could be |
|
oracle.hcat.osh.rowsPerSplit |
Used only when |
|
oracle.hcat.osh.authentication |
Authentication method used to connect to Oracle Database. Can be |
|
sun.security.krb5.principal |
Kerberos principal. Used only when |
|
oracle.hcat.osh.kerb.callback |
Callback for Kerberos authentication. Used only when Kerberos authentication is applied. |
|
oracle.hcat.osh.maxSplits |
Maximum number of splits for any splitter kind |
|
oracle.hcat.osh.useChunkSplitter |
Use chunk based ROW_SPLITTER and BLOCK_SPLITTER that use DBMS_PARALLEL_EXECUTE package to divide table into chunks that will map to hadoop splits.The default value is set to ‘true’. |
|
oracle.hcat.osh.chunkSQL |
Used by CUSTOM_SPLITTERto create splits. The SQL string should be a SELECT statement that returns range of each chunk must have two columns: start_id and end_id The columns must be of ROWID type.
|
|
oracle.hcat.osh.useOracleParallelism |
When configured, parallel queries will be executed while fetching rows from Oracle. Default value: ‘false’
|
|
oracle.hcat.osh.fetchSize |
JDBC fetchsize for generated select queries used to fetch rows. Default value: 10 (set by Oracle JDBC Driver) |
Note:
In addition to the above, any JDBC connection properties (oracle.jdbc.* and oracle.net.*) can be specified as TBLPROPERTIES. They will be used while establishing connection to Oracle Database using JDBC driver.
Note:
Oracle DataSource for Apache Hadoop (OD4H) works with Oracle View and Oracle Tables.
6.2.3.2 SERDE PROPERTIES
| Property | Use |
|---|---|
|
oracle.hcat.osh.columns.mapping |
All external tables using OracleStorageHandler must define this. Its a comma separated list to specify mapping between hive columns (specified in create table) and oracle table columns. |
6.2.4 List of jars in the OD4H package
Oracle DataSource for Apache Hadoop (OD4H) contains the following list of jars.
OD4H consists of the following list of jars.
Table 6-1 List of jars in OD4H
| Name of JAR | Use |
|---|---|
| osh.jar | Contains OracleStorageHandler Implementation |
| ojdbc8.jar | An OD4H specific JDBC driver (which is optimized with internal calls), used by Spark or Hadoop tasks to connect to the database. |
| ucp.jar | For creating connection pools in OracleStorageHandler |
| oraclepki103.jar, osdt_core.jar, osdt_cert.jar, osdt_jce.jar | For Oracle Wallet authentication |
| orai18n.jar | Oracle Globalization Support |
| xdb.jar | Oracle XDB jar |
6.3 How does OD4H work?
Oracle DataSource for Apache Hadoop (OD4H) does not require creating a new table. You can start working with OD4H using the following steps:
-
Create a new Oracle table, or, reuse an existing table.
-
Create the Hive DDL for creating the external table referencing the Oracle Table.
-
Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.
The following sections show how to create a new Oracle Database Table, and a Hive DDL:
6.3.1 Create a new Oracle Database Table or Reuse an Existing Table
Here is an illustration of a partitioned Oracle table that we will use to demo how partition pruning works:
CREATE TABLE EmployeeData ( Emp_ID NUMBER, First_Name VARCHAR2(20), Last_Name VARCHAR2(20), Job_Title VARCHAR2(40), Salary NUMBER) PARTITION BY RANGE (Salary) ( PARTITION salary_1 VALUES LESS THAN (60000) TABLESPACE tsa , PARTITION salary_2 VALUES LESS THAN (70000) TABLESPACE tsb , PARTITION salary_3 VALUES LESS THAN (80000) TABLESPACE tsc , PARTITION salary_4 VALUES LESS THAN (90000) TABLESPACE tsd , PARTITION salary_5 VALUES LESS THAN (100000) TABLESPACE tse );Note:
You can use this syntax for table creation, in the following examples listed in this Book.
-
Issue queries from Hive, Spark, or any other Hadoop models (including joins with local Hive Tables.)
6.3.2 Hive DDL
In this example, we will associate two Hive external tables to the same Oracle table, using two different split patterns:
-
SIMPLE_SPLITTER -
PARTITION_SPLITTER
Note:
It is possible that the external table has fewer columns than the base Oracle table.Since columns can have different names, use TBLPROPERTY for mapping with the base table.
In the following examples, we are using the following variables:
connection_string = jdbc:oracle:thin:@localhost:1521/<servicename>
oracle_user=od4h
oracle_pwd=od4h
The following command creates a Hive external table with the default split pattern, that is SIMPLE_SPLITTER.
CREATE EXTERNAL TABLE EmployeeDataSimple (
Emp_ID int,
First_Name string,
Last_Name string,
Job_Title string,
Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
'mapreduce.jdbc.username' = '${hiveconf:od4h}',
'mapreduce.jdbc.password' = '${hiveconf:od4h}',
'mapreduce.jdbc.input.table.name' = 'EmployeeData'
);The following example creates a Hive external table using PARTITION_SPLITTER.
DROP TABLE EmployeeDataPartitioned;
CREATE EXTERNAL TABLE EmployeeDataPartitioned (
Emp_ID int,
First_Name string,
Last_Name string,
Job_Title string,
Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = '${hiveconf:jdbc:oracle:thin:@localhost:1521/<servicename>}',
'mapreduce.jdbc.username' = '${hiveconf:od4h}',
'mapreduce.jdbc.password' = '${hiveconf:od4h}',
'mapreduce.jdbc.input.table.name' = 'EmployeeData',
'oracle.hcat.osh.splitterKind' = 'PARTITIONED_TABLE'
);6.3.3 Create External Tables in Hive
You can create an external table in Hive in the following way:
DROP TABLE employees;
CREATE EXTERNAL TABLE employees (
EMPLOYEE_ID INT,
FIRST_NAME STRING,
LAST_NAME STRING,
SALARY DOUBLE,
HIRE_DATE TIMESTAMP,
JOB_ID STRING
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')
TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES'
);Note:
For using OD4H, ensure that theucp.jar, the ojdbc8.jar file, and the osh.jar file are present in the Hive CLASSPATH environment variable. This is pre-configured on Oracle Big Data Appliance.
To learn more about the CLASSPATH environment variable and other Hive configuration properties, refer to the following sources:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
For Cloudera distribution, refer to the following page that explains the steps for Cloudera Manager:
https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cm_mc_hive_udf.html
For other distributions, refer to the respective documentation on adding additional JAR files to Hive or HiveServer2 environment.
6.4 Features of OD4H
The following topics discuss features of OD4H.
6.4.1 Performance And Scalability Features
Following sections discuss the performance and scalability features of OD4H:
HCatalog stores table metadata from Hive DDL. HiveSQl, Spark SQL and others, then use this metadata while submitting queries.
The Oracle table is divided into granules determined by the splitterKind property. These granules are then read into a split by OracleStorageHandler, by submitting generated queries.
OracleStorageHandler will not have to test all possible query types if the query plan determines which splits need to be scanned.
6.4.1.1 Splitters
While executing a query on a Hive external table through OD4H, the underlying Oracle table is dynamically divided into granules, which correspond to splits on the Hadoop side. Each split is processed by a single map task. With the help of the ORACLE_SPLITTER_KIND property, you can specify how the splits are created. This ensures that the splits are a good match for the physical structure of the target table in Oracle Database.
The different kinds of splitters available are:
SINGLE_SPLITTER
Creates one split for the table. Use SINGLE_SPLITTER where a single task is sufficient to process the query against the entire table.
ROW_SPLITTER
Limits the number of rows per Split. The default number of rows is 1000. You can specify number of rows by setting the oracle.hcat.osh.rowsPerSplit property. The default value of oracle.hcat.osh.maxSplits is 1 when ROW_SPLITTER is used. You can increase this value to enable parallel reads.
rowsPerSplit property, OD4H will divide tables into splits. If the number of splits obtained is higher than the maxSplits, then maxSplits property will be used. The rows per split will be divided accordingly.
Note:
oracle.hcat.osh.rowsPerSplitis used only by ROW_SPLITTER and not any other splitter kind.
BLOCK_SPLITTER
oracle.hcat.osh.maxSplits is 1, when BLOCK_SPLITTER is used. You can increase this value to enable parallel reads. BLOCK_SPLITTER requires SELECT privilege on the SYS.DBA.EXTENTS table, granted to the schema containing the Oracle target table. In the event that this permission does not exist, OD4H will use SINGLE_SPLITTER.Note:
The actual number of splits underBLOCK_SPLITTER may be lesser than the value specified in the oracle.hcat.osh.maxSplits property.
Do not use BLOCK_SPLITTER on partitioned tables or Index Organized tables.
Note:
ForROW_SPLITTER and BLOCK_SPLITTER types, use oracle.hcat.osh.useChunkSplitter to specify splitting mechanism. The default property value is true. This enables creating chunks corresponding to splits using the DBMS_PARALLEL_EXECUTE package. When the property value is false, custom SQL is generated for splitting. Since DBMS_PARALLEL_EXECUTE can only be used for tables and not views, if mapreduce.jdbc.input.table.name points to a view and not a table, then oracle.hcat.osh.useChunkSplitter should be set to false.
PARTITION_SPLITTER
Creates one split per partition. PARTITION_SPLITTER is used by default when the table is partitioned. You can override this setting by specifying ROW_SPLITTER in table properties. With PARTITION_SPLITTER, the default value of oracle.hcat.osh.maxSplits table property is 64.
Following is an illustration of ROW_SPLITTER:
DROP TABLE employees;
CREATE EXTERNAL TABLE employees (
EMPLOYEE_ID INT,
FIRST_NAME STRING,
LAST_NAME STRING,
SALARY DOUBLE,
HIRE_DATE TIMESTAMP,
JOB_ID STRING
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'employee_id,first_name,last_name,salary,hire_date,job_id')
TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@localhost:1521:orcl',
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'EMPLOYEES',
'oracle.hcat.osh.splitterKind' = 'ROW_SPLITTER',
'oracle.hcat.osh.rowsPerSplit' = '1500'
); CUSTOM_SPLITTER
Use CUSTOM_SPLITTER If you want to provide a custom split generation mechanism. You can do this using CUSTOM_SPLITTER through oracle.hcat.osh.splitterKind property and a SELECT statement that emits ROWIDs corresponding to start and end of each split in oracle.hcat.osh.chunkSQL.
6.4.1.2 Choosing a Splitter
SINGLE_SPLITTER is used by default if no splitter is specified in the table properties for Hive external table, and the target Oracle table is not partitioned.
For an unpartitioned table, the default value of oracle.hcat.osh.maxSplits will be 1. For partitioned table, the default value of the same will be 64, and the default splitter will be PARTITION_SPLITTER. The default for maxSplits is set to limit the number of connections to the Oracle server. To increase this limit, you must increase the value of oracle.hcat.osh.maxSplits explicitly in hive table properties.
Use the following guidelines while choosing a splitter kind for a hive external table:
| Splitter Kind | Use |
|---|---|
|
|
When no parallelism is required. |
|
|
Used by default when target table is partitioned |
|
|
When Oracle user has |
|
|
When Oracle user does not have |
|
|
For fine grain control over generated splits. |
6.4.1.3 Predicate Pushdown
Predicate Pushdown is an optimization technique, in which you push predicates (WHERE condition) down to be evaluated by Oracle Database at the time of querying. This minimizes the amount of data fetched from Oracle Database to Hive, while performing a query.
Set the configuration property hive.optimize.ppd to either true or false for enabling Predicate Pushdown. The default value on hive-1.1.0 is set to true. Hence, Predicate Pushdown is always performed, unless you want to disable it.
Note:
OD4H does not push down all possible predicates. It considers only the part of the execution plan pertaining to Oracle table declared as external table. OD4H also rewrites sub-queries for the Oracle SQL engine and each split task. At present conditions involving operators >,=,< and != in a single condition over a column (e.g. key > 10) or a combination of multiple conditions separated by AND (e.g. key > 10 AND key < 20 AND key !=17) are pushed down.
Another option to reduce the amount of data fetched from the Oracle Database is to specify a condition at the time of table creation, using TBLPROPERTY mapreduce.jdbc.input.conditions. For instance:
mapreduce.jdbc.input.conditions = 'key > 10 OR key = 0'.
Note:
Due to incompatibilities between date and timestamp representation in Hive and Oracle, these columns are not pushed down by default in a query. You can enable this with certain limitations by setting the tablepropertyoracle.hcat.datetime.pushdown to true. When set to true, the date representation in the query should be in the form YYYY-MM-DD and timetamp should be in the form “YYYY-MM-DD HH:MM:SS” with no decimal places. No other date or timestamp representation is supported when oracle.hcat.datetime.pushdown is set to true.
6.4.1.4 Projection Pushdown
Projection Pushdown is an optimization technique that fetches only the required columns from Oracle Database when a query is performed. If you want to fetch all columns during a query (not recommended), you can disable it by setting the hive.io.file.read.all.columns connection property to true. On Hive–1.1.0, this property is false by default.
6.4.1.5 Partition Pruning
If you refer to Employee Data Partition table, the partitions irrelevant to the query are removed from the partition list. This is done by executing an explain plan on the query to obtain the list of partitions and sub-partitions that are relevant to the query.
Table level partition pruning uses table level predicate pushdown, on the other hand partition pruning at the query level uses query level predicate pushdown.
Partition pruning is active when a SELECT query is run, in which the WHERE clause uses the partitioning key. Following is an example of partition pruning:
To query the partition, where salary is in the above range and prune other partitions, perform the following:
Hive External Table:
CREATE EXTERNAL TABLE EmployeeDataPartitioned (
Emp_ID int,
First_Name string,
Last_Name string,
Job_Title string,
Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'Emp_ID,First_Name,Last_Name,Job_Title,Salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = '${hiveconf:connection_string}',
'mapreduce.jdbc.username' = '${hiveconf:oracle_user}',
'mapreduce.jdbc.password' = '${hiveconf:oracle_pwd}',
'mapreduce.jdbc.input.table.name' = 'EmployeeData',
'oracle.hcat.osh.oosKind' = 'PARTITIONED_TABLE'
);
The following SELECT statement shows how to query the partition, where salary is between 72000 to 78000, and prunes other partitions:
select * from EmployeeDataPartitioned where salary > 72000 and salary < 78000;
6.4.2 Smart Connection Management
Connection Caching
Each map task runs in its own JVM. Each JVM in turn caches a single connection to the Oracle database that you can reuse within the same query. The Mapper checks the cache before establishing a new connection and caching is not done once the query has completed executing.
Oracle RAC Awareness
JDBC and UCP are aware of various Oracle RAC instances. This can be used to split queries submitted to JDBC. The StorageHandler will depend on listener for load balancing.
Handling Logon Storms
Hadoop allows you to limit the number of mappers attempting to connect to the Database. Hadoop allows you to limit the number of mappers attempting to connect to the Database using oracle.hcat.osh.maxSplits. This parameter controls the degree of concurrency. However, subsequent tasks of the same query are guaranteed to query their table granule as per the System Commit Number (SCN) of the query. This ensures consistency of the result sets.
Database Resident Connection Pooling (DRCP)
It is recommended to configure DRCP for OD4H, and limit the maximum number of concurrent connections to the Oracle Database from OD4H.
To configure DRCP, use the following steps:
-
Login as
SYSDBA. -
Start the default pool,
SYS_DEFAULT_CONNECTION_POOLusingDBMS_CONNECTION_POOL.START_POOLwith the default settings.You can use
DBMS_CONNECTION_POOL.MINSIZEandDBMS_CONNECTION_POOL.MAXSIZEwith the default settings.
See Also:
For more information on configuring DRCP see the Oracle Database Administration Guide6.4.3 Security Features
Following are the security features of OD4H:
6.4.3.1 Improved Authentication
OD4H uses Oracle JDBC driver for connecting to Oracle Database. It provides all authentication methods supported by Oracle JDBC. OD4H supports authentication through use of basic authentication (user name and password), Oracle Wallet, and Kerberos. You can specify the authentication to be used for a table created in Hive, through the oracle.hcat.osh.authentication table property. This is useful only for strong authentication.
-
Kerberos
-
Oracle Wallet
-
Basic Authentication
Note:
Oracle recommends using strong authentication such as Kerberos.The various authentication processes are described with examples as follows:
- Kerberos
Uses Kerberos credentials of the Hadoop engine process. This
principalshould have access to the table.See Also:
Oracle Database JDBC Developer's Guide for information on configuring database for Kerberos and details of client parameters
You can enable Kerberos configuration on Hive, by adding to
hive-env.shthe following:export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.conf=<path to kerberos configuration>To enable child JVMs to use Kerberos configuration, edit the
mapred-site.xmlto include the following property on all nodes of the cluster:<property><name>mapred.child.java.opts</name> <value>-Djava.security.krb5.conf=<path to kerberos configuration>></value></property>Enable these configurations on Oracle Big Data Appliance using Cloudera manager..
Following is an illustration of Kerberos authentication:
CREATE EXTERNAL TABLE kerb_example ( id DECIMAL, name STRING, salary DECIMAL ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'id,name,salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp) (HOST=adc*******.xxxxxx.com)(PORT=5521))(CONNECT_DATA= (SERVICE_NAME=project_name.xxx.rdbms.xxxx.com)))', 'mapreduce.jdbc.input.table.name' = 'kerb_example', 'mapreduce.jdbc.username' = 'CLIENT@xxxxxx.COM', 'oracle.hcat.osh.authentication' = 'KERBEROS', 'oracle.net.kerberos5_cc_name' = '/tmp/krb5cc_xxxxx', 'java.security.krb5.conf' = '/home/user/kerberos/krb5.conf', 'oracle.hcat.osh.kerb.callback' = 'KrbCallbackHandler', 'sun.security.krb5.principal' = 'CLIENT@xxxxx.COM' );The path specified in
oracle.security.krb5.confshould be accessible to all nodes of the cluster. These paths should also match with the path of the corresponding properties in Oracle Databasesqlnet.ora.Thekeytabpath provided insqlnet.orashould also be accessible from all nodes of the cluster.If
sun.security.krb5.principalis not specified, OD4H attempts to authenticate using default principal in Credential Cache specified by theoracle.net.kerberos5_cc_nameproperty.Note:
Thecallbackis called only if theprincipalcannot be authenticated using a ticket obtained from the credential cache specified inoracle.net.kerberos5_cc_nameproperty.A simple callback handler class is described as follows (The callback class must be available to the Hive classpath):
class KrbCallbackHandler implements CallbackHandler{ @Override public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException{ for (int i = 0; i < callbacks.length; i++){ if (callbacks[i] instanceof PasswordCallback){ PasswordCallback pc = (PasswordCallback)callbacks[i]; System.out.println("set password to 'welcome'"); pc.setPassword((new String("welcome")).toCharArray()); } else if (callbacks[i] instanceof NameCallback) { ((NameCallback)callbacks[i]).setName("client@xxxxx.COM"); }else{ throw new UnsupportedCallbackException(callbacks[i], "Unrecognized Callback"); } } } - Oracle Wallet
The wallet should be available in the OS environment of each engine process. Following is an illustration of how to add Wallet authentication:
CREATE EXTERNAL TABLE wallet_example ( id DECIMAL, salary DECIMAL ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( name STRING, 'oracle.hcat.osh.columns.mapping' = 'id,name,salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:/@inst1', 'mapreduce.jdbc.input.table.name' = 'wallet_example', 'oracle.hcat.osh.authentication' = 'ORACLE_WALLET', 'oracle.net.tns_admin' = '/scratch/user/view_storage/user_project6/work', 'oracle.net.wallet_location' = '/scratch/user/view_storage/user_project6/work' );Note:
The paths specified inoracle.net.tns_adminandoracle.net.wallet_locationshould be accessible from all nodes of the cluster. - Basic Authentication (for demonstration purposes only)
This is stored in HCatalog
TBLPROPERTIESor supplied on HiveQLSELECTstatement.When Basic Authentication is used, then the user name and password for Oracle Schema are specified in Hive external Table properties.
Note:
Oracle does not recommend this authentication process in the production environment because the password is stored unmasked in HCatalog.
6.5 Use HiveQL with OD4H
HiveQL is a SQL like language provided by Hive. It can be used to query hive external tables created using OD4H.
To track the status of a running query on Oracle Big Data Appliance, you can run the Resource Manager web interface in your browser in the following way:
http://<domain>:8088/<cluster_name>
You can also see the logs of a query in Cloudera Manager, which also indicates the actual query sent to Oracle Database corresponding to your query on HiveQL. Hive and OD4H use the Simple Logging Facade for Java (SLF4J) framework for logging. You can use the logging configuration techniques of Hive for controlling the logging level of classes related to OD4H.
6.6 Use Spark SQL with OD4H
Spark SQL enables you to execute relational queries expressed in SQL and HiveSQL using Spark. Spark SQL mixex SQL queries with programmatic data manipulations supported by RDDs (Resilient Distributed Datasets) in Java, Python and Scala, with a single application. You can also use it to query external tables created using OD4H.
Before running the queries, perform the following steps to configure Spark SQL on Oracle Big Data Appliance:
-
Add
ojdbc7.jarandosh.jarto CLASSPATH in/usr/lib/spark/bin/compute-classpath.shCLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/osh.jar" CLASSPATH="$CLASSPATH:/opt/oracle/od4h/lib/ojdbc7.jar"
-
Edit
SPARK_HOMEin/usr/lib/spark/conf/spark-env.shexport SPARK_HOME=/usr/lib/spark:/etc/hive/conf
-
You will need to specify additional environment variables in
/usr/lib/spark/conf/spark-env.sh.The Hive related variables that you need to substitute for running the code example, are marked in bold. The file already contains Hadoop related environment variables.
export DEFAULT_HADOOP=/usr/lib/hadoop export DEFAULT_HIVE=/usr/lib/hive export DEFAULT_HADOOP_CONF=/etc/hadoop/conf export DEFAULT_HIVE_CONF=/etc/hive/conf export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP} export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs} export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoop-mapreduce} export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$DEFAULT_HADOOP_CONF} export HIVE_CONF_DIR=${HIVE_CONF_DIR:-$DEFAULT_HIVE_CONF} CLASSPATH="$CLASSPATH:$HIVE_CONF_DIR" CLASSPATH="$CLASSPATH:$HADOOP_CONF_DIR" if [ "x" != "x$YARN_CONF_DIR" ]; then CLASSPATH="$CLASSPATH:$YARN_CONF_DIR" fi # Let's make sure that all needed hadoop libs are added properly CLASSPATH="$CLASSPATH:$HADOOP_HOME/client/*" CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/*" CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"
Once configured, you can run some sample queries on spark SQL using scripts included in the /shell/*QuerySpark.sh demo file. By default, Spark prints queries on the console. To modify this behavior you can edit the spark logging configuration file /usr/lib/spark/conf/log4j.properties.
The log printed by OracleRecordReader shows the actual query sent to Oracle Database, as follows:
15/03/18 10:36:08 INFO OracleRecordReader: Reading records from Oracle Table using Query: SELECT FIRST_NAME, LAST_NAME, EMP_ID FROM EmployeeData
6.7 Writing Back to Oracle Database
In the typical use case for OD4H, you store the result sets of Hive or Spark SQL queries back to Oracle Database. OD4H implements OutputFormat to enable you to write back to an Oracle Database table from Hadoop.
After the data is inserted into an Oracle Database table, you can then use your favorite business intelligence tools for further data mining
The following query is from the OD4H demo code samples. It demonstrates writing back to an external table called EmployeeBonusReport.
Example 6-1 Writing Hive or Spark Result Sets Back to Oracle Database
INSERT INTO EmployeeBonusReport
SELECT EmployeeDataSimple.First_Name, EmployeeDataSimple.Last_Name,
EmployeeBonus.bonus
FROM EmployeeDataSimple JOIN EmployeeBonus ON
(EmployeeDataSimple.Emp_ID=EmployeeBonus.Emp_ID)
WHERE salary > 70000 and bonus > 7000"

