This chapter describes how to use Oracle Big Data SQL to create external tables and access data from Hadoop data sources as well as Oracle NoSQL Database.
It also describes some of the changes that Oracle Big Data SQL makes on the Oracle Database server.
Oracle Big Data SQL enables you to query Hive tables from the Oracle Database using the full power of Oracle SQL SELECT statements. It also enables you to write queries that join Oracle tables and Hive data, leverage robust Oracle Database security features, and take advantage of advanced SQL capabilities like analytic functions, JSON handling, and others.
To enable Oracle Big Data SQL to query Hive data, you must first define an Oracle external table for your Hive data. There are a number of tools available to help you create the Oracle external table definition.
DBMS_HADOOP
DBMS_HADOOP is a PL/SQL package that contains the CREATE_EXTDDL_FOR_HIVE
procedure. This procedure generates the DDL to create an Oracle external table for a given Hive table. You can optionally edit the text of the generated DDL before execution in order to customize the external table properties.
The Big Data SQL wizard in Oracle SQL Developer
The most recent versions of the free Oracle SQL Developer tool include a Big Data SQL wizard that guides you easily through the process of creating Oracle external table definitions.
If you have a configured Hive connection in Oracle SQL Developer, then in the Connections navigator, drill down from the connection entry point to a Hive table and do the following:
Right-click on the table icon and select Use in Oracle Big Data SQL...
When prompted, select an Oracle Database connection for the import of the Hive table.
Select an Oracle Big Data SQL-enabled target database.
In the Create Table dialog, check over the current configuration for columns, external table properties, and storage. Modify as needed. You can also preview the text of the DDL that will be generated.
Click OK when you are satisfied with the table definition. The wizard will create the external table at the designated location.
The Oracle SQL Developer Data Modeler
This is free graphical design tool that you can use to connect to a Hive metastore and generate an external table. You can select and import one or multiple Hive tables, modify table properties as needed, and then generate the DDL that you can copy into an SQL Worksheet and then run in order to create an Oracle external table. Although the Data Modeler is a more complex tool to use than the other options, its advantage is that you can use it to work on multiple Hive tables
See Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL in the Oracle Blog space for a demonstration of how to use the Data Modeler.
See Also:
For instructions on how to install Oracle SQL Developer and connect to Hive in order to create external tables, see Using Oracle SQL Developer to Connect to HiveThe DBMS_HADOOP
PL/SQL package contains a function named CREATE_EXTDDL_FOR_HIVE
. It returns the data dictionary language (DDL) to create an external table for accessing a Hive table. This function requires you to provide basic information about the Hive table:
Name of the Hadoop cluster
Name of the Hive database
Name of the Hive table
Whether the Hive table is partitioned
You can obtain this information by querying the ALL_HIVE_TABLES
data dictionary view. It displays information about all Hive tables that you can access from Oracle Database.
This example shows that the current user has access to an unpartitioned Hive table named RATINGS_HIVE_TABLE
in the default database. A user named JDOE
is the owner.
SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables; CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME PARTITIONED ------------ -------------- -------- ------------------ -------------- hadoop1 default jdoe ratings_hive_table UN-PARTITIONED
See Also:
With the information from the data dictionary, you can use the CREATE_EXTDDL_FOR_HIVE
function of DBMS_HADOOP
. This example specifies a database table name of RATINGS_DB_TABLE
in the current schema. The function returns the text of the CREATE TABLE
command in a local variable named DDLout
, but does not execute it.
DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=>'hadoop1', DB_NAME=>'default', HIVE_TABLE_NAME=>'ratings_hive_table', HIVE_PARTITION=>FALSE, TABLE_NAME=>'ratings_db_table', PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLout ); dbms_output.put_line(DDLout); END; /
When this procedure runs, the PUT_LINE
function displays the CREATE TABLE
command:
CREATE TABLE ratings_db_table ( c0 VARCHAR2(4000), c1 VARCHAR2(4000), c2 VARCHAR2(4000), c3 VARCHAR2(4000), c4 VARCHAR2(4000), c5 VARCHAR2(4000), c6 VARCHAR2(4000), c7 VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=hadoop1 com.oracle.bigdata.tablename=default.ratings_hive_table ) ) PARALLEL 2 REJECT LIMIT UNLIMITED
You can capture this information in a SQL script, and use the access parameters to change the Oracle table name, the column names, and the data types as desired before executing it. You might also use access parameters to specify a date format mask.
The ALL_HIVE_COLUMNS
view shows how the default column names and data types are derived. This example shows that the Hive column names are C0 to C7, and that the Hive STRING
data type maps to VARCHAR2(4000)
:
SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns; TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE --------------------- ------------ ---------------- ------------------ ratings_hive_table c0 string VARCHAR2(4000) ratings_hive_table c1 string VARCHAR2(4000) ratings_hive_table c2 string VARCHAR2(4000) ratings_hive_table c3 string VARCHAR2(4000) ratings_hive_table c4 string VARCHAR2(4000) ratings_hive_table c5 string VARCHAR2(4000) ratings_hive_table c6 string VARCHAR2(4000) ratings_hive_table c7 string VARCHAR2(4000) 8 rows selected.
See Also:
Oracle SQL Developer provides methods to connect to a Hive metastore and create Oracle external tables over Hive.
Follow these steps to set up Oracle SQL Developer to work with Oracle Big Data SQL.
Install Oracle SQL Developer
Download the Hive JDBC Drivers
Add the new Hive JDBC Drivers to Oracle SQL Developer
Create a database connection to Hive.
Installing Oracle SQL Developer
Install Oracle SQL Developer 4.2 or greater. Release 4.2 is recommended because it includes support for Copy To Hadoop, a useful Oracle Big Data SQL tool for off-loading Oracle Database tables to HDFS.
The installation is simple. Just download the package and extract it.
Go to the Oracle SQL Developer download site on the Oracle Technology Network (OTN).
Accept the license agreement and download the version that is appropriate for your platform.
For most users, Windows 64–bit with JDK 8 included is the correct choice.
Extract the downloaded ZIP file to your local drive.
You can extract to any folder name.
See Installing and Getting Started with SQL Developer in the Oracle SQL Developer User’s Guide for further installation and configuration details.
Downloading and Installing the Hive JDBC Drivers for Cloudera Enterprise
To connect Oracle SQL Developer to Hive in the Hadoop environment, you need to download and install the Hive JDBC drivers for Cloudera Enterprise. These drivers are not included in the Oracle SQL Developer download package.
Note for HDP Users:
At this time, SQL Developer 4.2 requires the Cloudera JDBC drivers for Hive. However, these drivers appear to work against Hortonworks clusters as well. HDP users should test to determine if these drivers meet their needs.Download the latest Cloudera JDBC drivers for Hive from the Cloudera website to any local directory.
You can search for “cloudera hive jdbc drivers download
” on the Cloudera website to locate the available driver packages.
You are prompted to select the driver version, OS, and OS version (32/64 bit). At this time, the latest drive version is 2.5.18. You can choose the newest version available.
Unzip the archive:
unzip hive_jdbc_<version>.zip
View the extracted content. Notice that under the top-level folder there are multiple ZIP files. Each is for a different JDBC version. For this setup, only JBDC 4.0 is usable. Select the JDBC4_ ZIP file (JDBC4_<version>.zip
).
Important:
Choose only the JDBC4_ ZIP file, which contains the drivers for JDBC 4.0. This is the only compatible version. The drivers in other packages, such asJDBC41_*
, are not compatible with SQL Developer 4.2 and will return errors upon connection attempts.Unzip the JDBC4 archive to a target directory that is accessible to Oracle SQL Developer, for example, ./home/oracle/jdbc
:
# unzip Cloudera_HiveJDBC4_<version>.zip -d /home/oracle/jdbc/
The extracted content should be similar to this:
Cloudera_HiveJDBC4_2.5.18.1050\Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf Cloudera_HiveJDBC4_2.5.18.1050\Cloudera-JDBC-Driver-for-Apache-Hive-Release-Notes.pdf Cloudera_HiveJDBC4_2.5.18.1050\commons-codec-1.3.jar Cloudera_HiveJDBC4_2.5.18.1050\commons-logging-1.1.1.jar Cloudera_HiveJDBC4_2.5.18.1050\HiveJDBC4.jar Cloudera_HiveJDBC4_2.5.18.1050\hive_metastore.jar Cloudera_HiveJDBC4_2.5.18.1050\hive_service.jar Cloudera_HiveJDBC4_2.5.18.1050\httpclient-4.1.3.jar Cloudera_HiveJDBC4_2.5.18.1050\httpcore-4.1.3.jar Cloudera_HiveJDBC4_2.5.18.1050\libfb303-0.9.0.jar Cloudera_HiveJDBC4_2.5.18.1050\libthrift-0.9.0.jar Cloudera_HiveJDBC4_2.5.18.1050\log4j-1.2.14.jar Cloudera_HiveJDBC4_2.5.18.1050\out.txt Cloudera_HiveJDBC4_2.5.18.1050\ql.jar Cloudera_HiveJDBC4_2.5.18.1050\slf4j-api-1.5.11.jar Cloudera_HiveJDBC4_2.5.18.1050\slf4j-log4j12-1.5.11.jar Cloudera_HiveJDBC4_2.5.18.1050\TCLIServiceClient.jar Cloudera_HiveJDBC4_2.5.18.1050\zookeeper-3.4.6.jar
Add the new Hive JDBC Drivers to Oracle SQL Developer
Next, start up SQL Developer and copy all of the extracted driver files into “Third Party JDBC Drivers” in the Preferences window.
Navigate to the folder where you downloaded and extracted Oracle SQL Developer.
Click the sqldeveloper
subfolder. Then, click sqldeveloper.exe
in this folder.
In the SQL Developer menu bar, select Tools>Preferences.
In the file explorer of the Preferences window, expand Database and then click Third Party JDBC Drivers.
Click Add Entry.
Navigate to the folder where you sent the files extracted from Cloudera_HiveJDBC4_<version>.zip
. Copy all of the JAR files from the ZIP extraction into this window and then click OK.
Restart Oracle SQL Developer.
Create a Database Connection to Hive
After the drivers are installed, you can create a connection to Hiveserver2.
If you are creating a Kerberos-secured connection, you will need a user ID, the Kerberos connection parameters, and the number of the port where Hiveserver2 is running on the Hadoop system (typically, port 10000). A keytab must exist for the user.
If you not using Kerberos, you will need a user ID (the oracle
user or a user with equivalent privileges), the account password, and the Hiveserver2 port number.
See Create/Edit/Select Database Connection in the Oracle SQL Developer User’s Guide for a explanation of the fields in the Oracle and Hive tabs in the New/Select Database Connection dialog.
Whichever method you use to create an Oracle external table over Hive (DBMS_HADOOP, Oracle SQL Developer Data Modeler, Oracle Big Data Wizard in Oracle SQL Developer, or manual coding), you may need to set some access parameters to modify the default behavior of ORACLE_HIVE
.
The following statement creates an external table named ORDER
to access Hive data:
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), description VARCHAR2(100), order_total NUMBER (8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive);
Because no access parameters are set in the statement, the ORACLE_HIVE
access driver uses the default settings to do the following:
Connects to the default Hadoop cluster.
Uses a Hive table named order
. An error results if the Hive table does not have fields named CUST_NUM
, ORDER_NUM
, DESCRIPTION
, and ORDER_TOTAL
.
Sets the value of a field to NULL
if there is a conversion error, such as a CUST_NUM
value longer than 10 bytes.
You can set properties in the ACCESS PARAMETERS
clause of the external table clause, which override the default behavior of the access driver. The following clause includes the com.oracle.bigdata.overflow
access parameter. When this clause is used in the previous example, it truncates the data for the DESCRIPTION
column that is longer than 100 characters, instead of throwing an error:
(TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.overflow={"action:"truncate", "col":"DESCRIPTION""} ))
The next example sets most of the available parameters for ORACLE_HIVE
:
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.tablename: order_db.order_summary com.oracle.bigdata.colmap: {"col":"ITEM_CNT", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", \ "col":"DESCRIPTION"} com.oracle.bigdata.erroropt: [{"action":"replace", \ "value":"INVALID_NUM" , \ "col":["CUST_NUM","ORDER_NUM"]} ,\ {"action":"reject", \ "col":"ORDER_TOTAL} ))
The parameters make the following changes in the way that the ORACLE_HIVE
access driver locates the data and handles error conditions:
com.oracle.bigdata.tablename
: Handles differences in table names. ORACLE_HIVE
looks for a Hive table named ORDER_SUMMARY
in the ORDER.DB
database.
com.oracle.bigdata.colmap
: Handles differences in column names. The Hive ORDER_LINE_ITEM_COUNT
field maps to the Oracle ITEM_CNT
column.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
com.oracle.bigdata.erroropt
: Replaces bad data. Errors in the data for CUST_NUM
or ORDER_NUM
set the value to INVALID_NUM
.
You can use the ORACLE_HIVE
access driver to access data stored in Oracle NoSQL Database. However, you must first create a Hive external table that accesses the KVStore. Then you can create an external table in Oracle Database over it, similar to the process described in "Creating an Oracle External Table for Hive Data".
This section contains the following topics:
To provide access to the data in Oracle NoSQL Database, you create a Hive external table over the Oracle NoSQL table. Oracle Big Data SQL provides a StorageHandler named oracle.kv.hadoop.hive.table.TableStorageHandler
that enables Hive to read the Oracle NoSQL Database table format.
The following is the basic syntax of a Hive CREATE TABLE
statement for a Hive external table over an Oracle NoSQL table:
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ( "oracle.kv.kvstore" = "database", "oracle.kv.hosts" = "nosql_node1:port[, nosql_node2:port...]", "oracle.kv.hadoop.hosts" = "hadoop_node1[,hadoop_node2...]", "oracle.kv.tableName" = "table_name");
Hive CREATE TABLE Parameters
The name of the Hive external table being created.
This table name will be used in SQL queries issued in Oracle Database, so choose a name that is appropriate for users. The name of the external table that you create in Oracle Database must be identical to the name of this Hive table.
Table, column, and field names are case insensitive in Oracle NoSQL Database, Apache Hive, and Oracle Database.
The names and data types of the columns in the Hive external table. See Table 2-1 for the data type mappings between Oracle NoSQL Database and Hive.
Hive CREATE TABLE TBLPROPERTIES Clause
The name of the KVStore. Only upper- and lowercase letters and digits are valid in the name.
A comma-delimited list of host names and port numbers in the Oracle NoSQL Database cluster. Each string has the format hostname:port. Enter multiple names to provide redundancy in the event that a host fails.
A comma-delimited list of all host names in the Hadoop cluster with Oracle Big Data SQL enabled.
The name of the table in Oracle NoSQL Database that stores the data for this Hive external table.
See Also:
Apache Hive Language Manual DDL at
Use the following syntax to create an external table in Oracle Database that can access the Oracle NoSQL data through a Hive external table:
CREATE TABLE tablename(colname colType[, colname colType...]) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY directory ACCESS PARAMETERS (access parameters) ) REJECT LIMIT UNLIMITED;
In this syntax, you identify the column names and data types. For more about this syntax, see "About the SQL CREATE TABLE Statement".
When Oracle Big Data SQL retrieves data from Oracle NoSQL Database, the data is converted twice to another data type:
To a Hive data type when the data is read into the columns of the Hive external table.
To an Oracle data type when the data is read into the columns of an Oracle Database external table.
. In order to execute a Big Data SQL query against data stored in an Oracle NoSQL Database table, a Hive external table must first be created with a schema mapped from the schema of the desired Oracle NoSQL Database table.Table 2-1 identifies the supported data types Oracle NoSQL Database table API and their mappings to Hive.
Table 2-1 Mapping Hive Data Types to the NoSQL Database Table API Data Model
Oracle NoSQL Database Table API | Hive |
---|---|
FieldDef.Type.STRING |
STRING |
FieldDef.Type.BOOLEAN |
BOOLEAN |
FieldDef.Type.BINARY |
BINARY |
FieldDef.Type.FIXED_BINARY |
BINARY |
FieldDef.Type.INTEGER |
INT |
FieldDef.Type.LONG |
BIGINT |
FieldDef.Type.FLOAT |
FLOAT |
FieldDef.Type.DOUBLE |
DOUBLE |
FieldDef.Type.ENUM |
STRING |
FieldDef.Type.ARRAY |
ARRAY |
FieldDef.Type.MAP |
MAP<STRING, data_type> |
FieldDef.Type.RECORD |
STRUCT<col_name : data_type, ...> |
Note:
To complete this mapping a corresponding Oracle Database external table must be created with a schema mapped from the schema of the Hive table.
Also note that the following Hive data types are not applicable to the mapping of Oracle NoSQL data types to Oracle Database data types: VARCHAR, CHAR, TINYINT, SMALLINT, DECIMAL, TIMESTAMP, DATE, UNION TYPE.
See Also:
About Data Type Conversions provides details on Hive to Oracle Database data type mappings.Predicate Pushdown in Oracle Big Data SQL requires that certain mappings between Hive Datatypes and Oracle Datatypes be present. See About Predicate Push Down.
This example uses the sample data provided with the Oracle NoSQL Database software:
Verify that the following files reside in the examples/hadoop/table
directory:
create_vehicle_table.kvs CountTableRows.java LoadVehicleTable.java
This example runs on a Hadoop cluster node named some1node07 and uses a KVStore named SOME1KV.
To create and populate the sample table in Oracle NoSQL Database:
Open a connection to an Oracle NoSQL Database node on your Hadoop cluster.
Create a table named vehicleTable
. The following example uses the load
command to run the commands in create_vehicle_table.kvs
:
$ cd NOSQL_HOME
$ java -jar lib/kvcli.jar -host some1node07 -port 5000 \
load -file examples/hadoop/table/create_vehicle_table.kvs
Compile LoadVehicleTable.java
:
$ javac -cp examples:lib/kvclient.jar examples/hadoop/table/LoadVehicleTable.java
Execute the LoadVehicleTable
class to populate the table:
$ java -cp examples:lib/kvclient.jar hadoop.table.LoadVehicleTable -host some1node07 -port 5000 -store SOME1KV
{"type":"auto","make":"Chrysler","model":"PTCruiser","class":"4WheelDrive","colo
r":"white","price":20743.240234375,"count":30}
{"type":"suv","make":"Ford","model":"Escape","class":"FrontWheelDrive","color":"
.
.
.
10 new records added
The vehicleTable
table contains the following fields:
Table 2-2 Fields in the vehicleTable Example
Field Name | Data Type |
---|---|
type |
STRING |
make |
STRING |
model |
STRING |
class |
STRING |
color |
STRING |
price |
DOUBLE |
count |
INTEGER |
The following example creates a Hive table named VEHICLES
that accesses vehicleTable
in the SOME1KV KVStore. In this example, the system is configured with a Hadoop cluster in the first six servers (some1node01 to some1node06) and an Oracle NoSQL Database cluster in the next three servers (some1node07 to some1node09).
CREATE EXTERNAL TABLE IF NOT EXISTS vehicles (type STRING, make STRING, model STRING, class STRING, color STRING, price DOUBLE, count INT) COMMENT 'Accesses data in vehicleTable in the SOME1KV KVStore' STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ("oracle.kv.kvstore" = "SOME1KV", "oracle.kv.hosts" = "some1node07.example.com:5000,some1node08.example.com:5000", "oracle.kv.hadoop.hosts" = "some1node01.example.com,some1node02.example.com,some1node03.example.com,some1node04.example.com,some1node05.example.com,some1node06.example.com", "oracle.kv.tableName" = "vehicleTable");
The DESCRIBE
command lists the columns in the VEHICLES
table:
hive> DESCRIBE vehicles;
OK
type string from deserializer
make string from deserializer
model string from deserializer
class string from deserializer
color string from deserializer
price double from deserializer
count int from deserializer
A query against the Hive VEHICLES
table returns data from the Oracle NoSQL vehicleTable
table:
hive> SELECT make, model, class FROM vehicletable WHERE type='truck' AND color='red' ORDER BY make, model; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 . . . Chrysler Ram1500 RearWheelDrive Chrysler Ram2500 FrontWheelDrive Ford F150 FrontWheelDrive Ford F250 RearWheelDrive Ford F250 AllWheelDrive Ford F350 RearWheelDrive GM Sierra AllWheelDrive GM Silverado1500 RearWheelDrive GM Silverado1500 AllWheelDrive
After you create the Hive table, the metadata is available in the Oracle Database static data dictionary views. The following SQL SELECT
statement returns information about the Hive table created in the previous topic:
SQL> SELECT table_name, column_name, hive_column_type FROM all_hive_columns WHERE table_name='vehicles'; TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE --------------- ------------ ---------------- vehicles type string vehicles make string vehicles model string vehicles class string vehicles color string vehicles price double vehicles count int
The next SQL CREATE TABLE
statement generates an external table named VEHICLES
over the Hive VEHICLES
table, using the ORACLE_HIVE
access driver. The name of the table in Oracle Database must be identical to the name of the table in Hive. However, both Oracle NoSQL Database and Oracle Database are case insensitive.
CREATE TABLE vehicles (type VARCHAR2(10), make VARCHAR2(12), model VARCHAR2(20), class VARCHAR2(40), color VARCHAR2(20), price NUMBER(8,2), count NUMBER) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.debug=true com.oracle.bigdata.log.opt=normal)) REJECT LIMIT UNLIMITED;
This SQL SELECT
statement retrieves all rows for red trucks from vehicleTable
in Oracle NoSQL Database:
SQL> SELECT make, model, class FROM vehicles WHERE type='truck' AND color='red' ORDER BY make, model; MAKE MODEL CLASS ------------ -------------------- --------------------- Chrysler Ram1500 RearWheelDrive Chrysler Ram2500 FrontWheelDrive Ford F150 FrontWheelDrive Ford F250 AllWheelDrive Ford F250 RearWheelDrive Ford F350 RearWheelDrive GM Sierra AllWheelDrive GM Silverado1500 RearWheelDrive GM Silverado1500 4WheelDrive GM Silverado1500 AllWheelDrive
You can also use the ORACLE_HIVE
access driver to access data stored in Apache HBase. However, you must first create a Hive external table that accesses the HBase table. Then you can create an external table in Oracle Database over it. The basic steps are the same as those described in "Creating an Oracle External Table for Oracle NoSQL Database".
To provide access to the data in an HBase table, you create a Hive external table over it. Apache provides a storage handler and a SerDe that enable Hive to read the HBase table format.
The following is the basic syntax of a Hive CREATE TABLE
statement for an external table over an HBase table:
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format'='1', 'hbase.columns.mapping'=':key,value:key,value:
See Also:
Apache Hive Language Manual DDL at
Hive HBase Integration at
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-StorageHandlers
Class HBaseSerDe
in the Hive API reference at
http://hive.apache.org/javadocs/r0.13.1/api/hbase-handler/index.html
Use the following syntax to create an external table in Oracle Database that can access the HBase data through a Hive external table:
CREATE TABLE tablename(colname colType[, colname colType...])
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(access parameters)
)
REJECT LIMIT UNLIMITED;
In this syntax, you identify the column names and data types. To specify the access parameters, see "About the SQL CREATE TABLE Statement".
The ORACLE_HDFS
access driver enables you to access many types of data that are stored in HDFS, but which do not have Hive metadata. You can define the record format of text data, or you can specify a SerDe for a particular data format.
You must create the external table for HDFS files manually, and provide all the information the access driver needs to locate the data, and parse the records and fields. The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL
statements.
The following statement creates a table named ORDER
to access the data in all files stored in the /usr/cust/summary
directory in HDFS:
CREATE TABLE ORDER (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER (8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ) LOCATION ('hdfs:/usr/cust/summary/*');
Because no access parameters are set in the statement, the ORACLE_HDFS
access driver uses the default settings to do the following:
Connects to the default Hadoop cluster.
Reads the files as delimited text, and the fields as type STRING
.
Assumes that the number of fields in the HDFS files match the number of columns (three in this example).
Assumes the fields are in the same order as the columns, so that CUST_NUM
data is in the first field, ORDER_NUM
data is in the second field, and ORDER_TOTAL
data is in the third field.
Rejects any records in which the value causes a data conversion error: If the value for CUST_NUM
exceeds 10 characters, the value for ORDER_NUM
exceeds 20 characters, or the value of ORDER_TOTAL
cannot be converted to NUMBER
.
You can use many of the same access parameters with ORACLE_HDFS
as ORACLE_HIVE
.
The following example is equivalent to the one shown in "Overriding the Default ORACLE_HIVE Settings". The external table access a delimited text file stored in HDFS.
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.colmap: {"col":"item_cnt", "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", "col":"DESCRIPTION"} com.oracle.bigdata.erroropt: [{"action":"replace", \ "value":"INVALID NUM", \ "col":["CUST_NUM","ORDER_NUM"]} , \ {"action":"reject", "col":"ORDER_TOTAL}] ) LOCATION ('hdfs:/usr/cust/summary/*'));
The parameters make the following changes in the way that the ORACLE_HDFS
access driver locates the data and handles error conditions:
com.oracle.bigdata.colmap
: Handles differences in column names. ORDER_LINE_ITEM_COUNT
in the HDFS files matches the ITEM_CNT
column in the external table.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
com.oracle.bigdata.erroropt
: Replaces bad data. Errors in the data for CUST_NUM
or ORDER_NUM
set the value to INVALID_NUM
.
The next example uses a SerDe to access Avro container files.
CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.rowformat: \ SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' com.oracle.bigdata.fileformat: \ INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'\ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' com.oracle.bigdata.colmap: { "col":"item_cnt", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", \ "col":"DESCRIPTION"} ) LOCATION ('hdfs:/usr/cust/summary/*'));
The access parameters provide the following information to the ORACLE_HDFS
access driver:
com.oracle.bigdata.rowformat
: Identifies the SerDe that the access driver needs to use to parse the records and fields. The files are not in delimited text format.
com.oracle.bigdata.fileformat
: Identifies the Java classes that can extract records and output them in the desired format.
com.oracle.bigdata.colmap
: Handles differences in column names. ORACLE_HDFS
matches ORDER_LINE_ITEM_COUNT
in the HDFS files with the ITEM_CNT
column in the external table.
com.oracle.bigdata.overflow
: Truncates string data. Values longer than 100 characters for the DESCRIPTION
column are truncated.
The SQL CREATE TABLE
statement has a clause specifically for creating external tables. The information that you provide in this clause enables the access driver to read data from an external source and prepare the data for the external table.
The following is the basic syntax of the CREATE TABLE
statement for external tables:
CREATE TABLE table_name (column_name datatype, column_name datatype[,...]) ORGANIZATION EXTERNAL (external_table_clause);
You specify the column names and data types the same as for any other table. ORGANIZATION EXTERNAL
identifies the table as an external table.
The external_table_clause identifies the access driver and provides the information that it needs to load the data. See "About the External Table Clause".
CREATE TABLE ORGANIZATION EXTERNAL
takes the external_table_clause as its argument. It has the following subclauses:
See Also:
Oracle Database SQL Language Reference for the external_table_clause
The TYPE
clause identifies the access driver. The type of access driver determines how the other parts of the external table definition are interpreted.
Specify one of the following values for Oracle Big Data SQL:
ORACLE_HDFS
: Accesses files in an HDFS directory.
ORACLE_HIVE
: Accesses a Hive table.
Note:
The ORACLE_DATAPUMP
and ORACLE_LOADER
access drivers are not associated with Oracle Big Data SQL.
The DEFAULT DIRECTORY
clause identifies an Oracle Database directory object. The directory object identifies an operating system directory with files that the external table reads and writes.
ORACLE_HDFS
and ORACLE_HIVE
use the default directory solely to write log files on the Oracle Database system.
The LOCATION
clause for ORACLE_HDFS
contains a comma-separated list of file locations. The files must reside in the HDFS file system on the default cluster.
A location can be any of the following:
A fully qualified HDFS directory name, such as /user/hive/warehouse/hive_seed/hive_types
. ORACLE_HDFS
uses all files in the directory.
A fully qualified HDFS file name, such as /user/hive/warehouse/hive_seed/hive_types/hive_types.csv
A URL for an HDFS file or a set of files, such as hdfs:/user/hive/warehouse/hive_seed/hive_types/*
. It is invalid to use the directory name alone.
The file names can contain any pattern-matching character described in Table 2-3.
Table 2-3 Pattern-Matching Characters
Character | Description |
---|---|
? |
Matches any one character |
* |
Matches zero or more characters |
[abc] |
Matches one character in the set {a, b, c} |
[a-b] |
Matches one character in the range {a...b}. The character must be less than or equal to b. |
[^a] |
Matches one character that is not in the character set or range {a}. The carat (^) must immediately follow the left bracket, with no spaces. |
\c |
Removes any special meaning of c. The backslash is the escape character. |
{ab\,cd} |
Matches a string from the set {ab, cd}. The escape character (\) removes the meaning of the comma as a path separator. |
{ab\,c{de\,fh} |
Matches a string from the set {ab, cde, cfh}. The escape character (\) removes the meaning of the comma as a path separator. |
Do not specify the LOCATION
clause for ORACLE_HIVE
; it raises an error. The data is stored in Hive, and the access parameters and the metadata store provide the necessary information.
Limits the number of conversion errors permitted during a query of the external table before Oracle Database stops the query and returns an error.
Any processing error that causes a row to be rejected counts against the limit. The reject limit applies individually to each parallel query (PQ) process. It is not the total of all rejected rows for all PQ processes.
The ACCESS PARAMETERS
clause provides information that the access driver needs to load the data correctly into the external table. See "CREATE TABLE ACCESS PARAMETERS Clause".
When the access driver loads data into an external table, it verifies that the Hive data can be converted to the data type of the target column. If they are incompatible, then the access driver returns an error. Otherwise, it makes the appropriate data conversion.
Hive typically provides a table abstraction layer over data stored elsewhere, such as in HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from its stored format into a Hive data type. The access driver then converts the data from its Hive data type to an Oracle data type. For example, if a Hive table over a text file has a BIGINT
column, then the SerDe converts the data from text to BIGINT
. The access driver then converts the data from BIGINT
(a Hive data type) to NUMBER
(an Oracle data type).
Performance is better when one data type conversion is performed instead of two. The data types for the fields in the HDFS files should therefore indicate the data that is actually stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file is text. If the Hive type for a field is DATE
, then the SerDe converts the data from string (in the data file) to a Hive date. Then the access driver converts the data from a Hive date to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not perform a conversion, and the access driver converts the data from string to an oracle date. Queries against the external table are faster in the second example, because the access driver performs the only data conversion.
The table below identifies the data type conversions that ORACLE_HIVE
can make when loading data into an external table.
Table 2-4 Supported Hive to Oracle Data Type Conversions
Hive Data Type | VARCHAR2, CHAR, NCHAR2, NCHAR, CLOB | NUMBER, FLOAT, BINARY_NUMBER, BINARY_FLOAT | BLOB | RAW | DATE, TIMESTAMP, TIMESTAMP WITH TZ, TIMESTAMP WITH LOCAL TZ | INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND |
---|---|---|---|---|---|---|
INT SMALLINT TINYINT BIGINT |
yes |
yes |
yes |
yes |
no |
no |
DOUBLE FLOAT |
yes |
yes |
yes |
yes |
no |
no |
DECIMAL |
yes |
yes |
no |
no |
no |
no |
BOOLEAN |
yesFoot 1 |
yesFoot 2 |
yesFootref 2 |
yes |
no |
no |
BINARY |
yes |
no |
yes |
yes |
no |
no |
STRING |
yes |
yes |
no |
no |
yes |
yes |
TIMESTAMP |
yes |
no |
no |
no |
yes |
no |
STRUCT ARRAY UNIONTYPE MAP |
yes |
no |
no |
no |
no |
no |
Footnote 1
FALSE maps to the string FALSE
, and TRUE maps to the string TRUE
.
Footnote 2
FALSE maps to 0, and TRUE maps to 1.
Users can query external tables using the SQL SELECT
statement, the same as they query any other table.
Users who query the data on a Hadoop cluster must have READ
access in Oracle Database to the external table and to the database directory object that points to the cluster directory. See "About the Cluster Directory".
By default, a query returns no data if an error occurs while the value of a column is calculated. Processing continues after most errors, particularly those thrown while the column values are calculated.
Use the com.oracle.bigdata.erroropt
and com.oracle.bigdata.overflow parameters to determine how errors are handled.
This section explains the changes that the Oracle Big Data SQL installation makes to the Oracle Database system (which may or may not be an Oracle Exadata Machine).
The directory bigdata_config
contains configuration information that is common to all Hadoop clusters. This directory is located on the Oracle Database system under $ORACLE_HOME/bigdatasql
. The oracle
file system user (or whichever user owns the Oracle Database instance) owns bigdata_config
. The Oracle Database directory ORACLE_BIGDATA_CONFIG
points to bigdata_config
.
The installation store these files in the bigdata_config
directory under $ORACLE_HOME/bigdatasql
:
The Oracle DBA can edit these configuration files as necessary.
The bigdata.properties
file in the common directory contains property-value pairs that define the Java class paths and native library paths required for accessing data in HDFS.
These properties must be set:
The following list describes all properties permitted in bigdata.properties
.
bigdata.properties
Property | Description |
---|---|
bigdata.cluster.default |
The name of the default Hadoop cluster. The access driver uses this name when the access parameters do not specify a cluster. Required. Changing the default cluster name might break external tables that were created previously without an explicit cluster name. |
bigdata.cluster.list |
A comma-separated list of Hadoop cluster names. Optional. |
java.classpath.hadoop |
The Hadoop class path. Required. |
java.classpath.hive |
The Hive class path. Required. |
java.classpath.oracle |
The path to the Oracle JXAD Java JAR file. Required. |
java.classpath.user |
The path to user JAR files. Optional. |
java.libjvm.file |
The full file path to the JVM shared library (such as |
java.options |
A comma-separated list of options to pass to the JVM. Optional. This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls: Xmx2048m,-verbose=jni |
LD_LIBRARY_PATH |
A colon separated (:) list of directory paths to search for the Hadoop native libraries. Recommended. If you set this option, then do not set java.library path in |
Example 2-1 shows a sample bigdata.properties
file.
Example 2-1 Sample bigdata.properties File
# bigdata.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata.properties - Big Data Properties File # # DESCRIPTION # Properties file containing parameters for allowing access to Big Data # Fixed value properties can be added here # java.libjvm.file=$ORACLE_HOME/jdk/jre/lib/amd64/server/libjvm.so java.classpath.oracle=$ORACLE_HOME/hadoopcore/jlib/*:$ORACLE_HOME/hadoop/jlib/hver-2/*:$ORACLE_HOME/dbjava/lib/* java.classpath.hadoop=$HADOOP_HOME/*:$HADOOP_HOME/lib/* java.classpath.hive=$HIVE_HOME/lib/* LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib bigdata.cluster.default=hadoop_cl_1
The bigdata-log4j.properties
file in the common directory defines the logging behavior of queries against external tables in the Java code. Any log4j
properties are allowed in this file.
Example 2-2 shows a sample bigdata-log4j.properties
file with the relevant log4j
properties.
Example 2-2 Sample bigdata-log4j.properties File
# bigdata-log4j.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata-log4j.properties - Big Data Logging Properties File # # DESCRIPTION # Properties file containing logging parameters for Big Data # Fixed value properties can be added here bigsql.rootlogger=INFO,console log4j.rootlogger=DEBUG, file log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n log4j.appender.file=org.apache.log4j.RollingFileAppender log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n log4j.logger.oracle.hadoop.sql=ALL, file bigsql.log.dir=. bigsql.log.file=bigsql.log log4j.appender.file.File=$ORACLE_HOME/bigdatalogs/bigdata-log4j.log
The cluster directory contains configuration information for a Hadoop cluster. Each cluster that Oracle Database will access using Oracle Big Data SQL has a cluster directory. This directory is located on the Oracle Database system under the bigdata_config
directory. For example, a cluster named bda1_cl_1 would have a directory by the same name in the bigdata_config directory –$ORACLE_HOME/bigdatasql/bigdata_config/bda1_c1_1
.
The cluster directory contains the client configuration files for accessing the cluster, such as the following:
bigdata.hosts
(not editable by customers)
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
(optional)
log4j
property files (such as hive-log4j.properties
)
A database directory object points to the cluster directory. Users who want to access the data in a cluster must have read access to the directory object.
On the Oracle database server, the oracle
user (or whatever user owns the Oracle Database installation directory) requires READ/WRITE access to the database directory that points to the log directory.
There must also be a corresponding oracle
user defined in the Hadoop cluster. The oracle
user requires READ access to HDFS on all DataNodes in the cluster.
In HDFS, the home directory (/user/oracle
) is created by default, but if oracle
is not the owner of the database instance on the database server, then you must create the home directory for the user manually.
For example, if “orclusr” (and its group "orclgrp") owned the database instance, then you would create the home directory /user/orclusr
:
# hdfs dfs -mkdir /user/orclusr # hdfs dfs -chown orclusr:orclgrp /user/orclusr