This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.
This chapter contains the following sections:
Dropping Database Objects Created by Oracle SQL Connector for HDFS
More About External Tables Generated by the ExternalTable Tool
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Apache Hadoop in these formats:
Data Pump files in HDFS
Delimited text files in HDFS
Delimited text files in Apache Hive tables
For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Partitioned Hive tables are supported, enabling you to represent a subset of Hive table partitions in Oracle Database, instead of the entire Hive table.
Oracle SQL Connector for HDFS uses external tables and database views to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. Oracle SQL Connector for HDFS creates database views over external tables to support access to partitioned Hive tables. By querying the external tables or views, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.
To create these objects in Oracle Database, you use the ExternalTable
command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable
with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as options to the ExternalTable
command or in an XML file.
When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.
You can also perform bulk loads of data into Oracle database tables using SQL. You may prefer that the data resides in an Oracle database—all of it or just a selection—if it is queried routinely.
The following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.
Log in to a system where Oracle SQL Connector for HDFS is installed, which can be the Oracle Database system, a node in the Hadoop cluster, or a system set up as a remote client for the Hadoop cluster.
See "Installing and Configuring a Hadoop Client on the Oracle Database System."
The first time you use Oracle SQL Connector for HDFS, ensure that the software is configured.
See "Configuring Your System for Oracle SQL Connector for HDFS." You might also need to edit hdfs_stream
if your environment is unique. See "Installing Oracle SQL Connector for HDFS".
If you are connecting to a secure cluster, then run kinit
to authenticate yourself.
See "Using Oracle SQL Connector for HDFS on a Secure Hadoop Cluster."
Create an XML document describing the connections and the data source, unless you are providing these properties in the ExternalTable
command.
Create a shell script containing an ExternalTable
command.
Run the shell script.
If the job fails, then use the diagnostic messages in the output to identify and correct the error. Depending on how far the job progressed before failing, you may need to delete the table definition from the Oracle database before rerunning the script.
After the job succeeds, connect to Oracle Database as the owner of the external table. Query the table to ensure that the data is accessible.
If the data will be queried frequently, then you may want to load it into a database table to improve querying performance. External tables do not have indexes or partitions.
If you want the data to be compressed as it loads into the table, then create the table with the COMPRESS
option.
To delete the Oracle Database objects created by Oracle SQL Connector for HDFS, use the -drop
command.
See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Example 2-1 illustrates these steps.
Example 2-1 Accessing HDFS Data Files from Oracle Database
$ cat moviefact_hdfs.sh # Add environment variables export OSCH_HOME="/u01/connectors/orahdfs-3.0.0" hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -conf /home/oracle/movies/moviefact_hdfs.xml \ -createTable $ cat moviefact_hdfs.xml <?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>MOVIE_FACTS_EXT</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>4</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/user/oracle/moviework/data/part*</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>\u0009</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>MOVIEDEMO_DIR</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value> </property> <property> <name>oracle.hadoop.exttab.colMap.TIME_ID.columnType</name> <value>TIMESTAMP</value> </property> <property> <name>oracle.hadoop.exttab.colMap.timestampMask</name> <value>YYYY-MM-DD:HH:MI:SS</value> </property> <property> <name>oracle.hadoop.exttab.colMap.RECOMMENDED.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.RATING.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.SALES.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>text</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@localhost:1521:orcl</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>MOVIEDEMO</value> </property> </configuration> $ sh moviefact_hdfs.sh Oracle SQL Connector for HDFS Release 3.0.0 - Production Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved. [Enter Database Password: password] The create table command succeeded. CREATE TABLE "MOVIEDEMO"."MOVIE_FACTS_EXT" ( "CUST_ID" VARCHAR2(4000), "MOVIE_ID" VARCHAR2(4000), "GENRE_ID" VARCHAR2(4000), "TIME_ID" TIMESTAMP(9), "RECOMMENDED" NUMBER, "ACTIVITY_ID" NUMBER, "RATING" NUMBER, "SALES" NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "MOVIEDEMO_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'09' MISSING FIELD VALUES ARE NULL ( "CUST_ID" CHAR(4000), "MOVIE_ID" CHAR(4000), "GENRE_ID" CHAR(4000), "TIME_ID" CHAR, "RECOMMENDED" CHAR, "ACTIVITY_ID" CHAR, "RATING" CHAR, "SALES" CHAR ) ) LOCATION ( 'osch-20141114064206-5250-1', 'osch-20141114064206-5250-2', 'osch-20141114064206-5250-3', 'osch-20141114064206-5250-4' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20141114064206-5250-1 contains 1 URI, 12754882 bytes 12754882 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00001 osch-20141114064206-5250-2 contains 1 URI, 438 bytes 438 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00002 osch-20141114064206-5250-3 contains 1 URI, 432 bytes 432 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00003 osch-20141114064206-5250-4 contains 1 URI, 202 bytes 202 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00004 $ sqlplus moviedemo SQL*Plus: Release 12.1.0.1.0 Production on Fri Apr 18 09:24:18 2014 Copyright (c) 1982, 2013, Oracle. All rights reserved. Enter password: password Last Successful login time: Thu Apr 17 2014 18:42:01 -05:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> DESCRIBE movie_facts_ext; Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID VARCHAR2(4000) MOVIE_ID VARCHAR2(4000) GENRE_ID VARCHAR2(4000) TIME_ID TIMESTAMP(9) RECOMMENDED NUMBER ACTIVITY_ID NUMBER RATING NUMBER SALES NUMBER SQL> CREATE TABLE movie_facts AS SELECT * FROM movie_facts_ext; Table created. SQL> SELECT movie_id, time_id, recommended, rating FROM movie_facts WHERE rownum < 5; MOVIE_ID TIME_ID RECOMMENDED RATING -------- -------------------------------- ----------- ---------- 205 03-DEC-10 03.14.54.000000000 AM 1 1 77 14-AUG-11 10.46.55.000000000 AM 1 3 116 24-NOV-11 05.43.00.000000000 AM 1 5 141 01-JAN-11 05.17.57.000000000 AM 1 4
You can run Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:
For Hive sources, log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.
For text and Data Pump format files, log in to either the Oracle Database system or a node in the Hadoop cluster.
Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH
environment variable on the system where you log in. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "Setting Up User Accounts on the Oracle Database System".
Setting up the environment variables:
Verify that HADOOP_CLASSPATH
includes the path to the JAR files for Oracle SQL Connector for HDFS:
path/orahdfs-3.0.0/jlib/*
If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH
also includes the Hive JAR files and conf directory. For example:
/usr/lib/hive/lib/* /etc/hive/conf
For your convenience, you can create an OSCH_HOME
environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:
$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
See Also:
"Oracle SQL Connector for Hadoop Distributed File System Setup" for instructions for installing the software and setting up user accounts on both systems.
OSCH_HOME
/doc/README.txt
for information about known problems with Oracle SQL Connector for HDFS.
Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable
. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.
The ExternalTable
tool uses the values of several properties to do the following tasks:
Create an external table
Populate the location files
Publish location files to an existing external table
List the location files
Describe an external table
You can specify these property values in an XML document or individually on the command line. See "Configuring Oracle SQL Connector for HDFS.".
This is the full syntax of the ExternalTable
command-line tool, which is run using the hadoop
command:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute [--output filename.sql]] | -drop [--noexecute] | -describe | -publish [--noexecute] | -listlocations [--details] | -getDDL
You can either create the OSCH_HOME
environment variable or replace OSCH_HOME
in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:
/opt/oracle/orahdfs-version
For example, you might run the ExternalTable
command-line tool with a command like this:
hadoop jar /opt/oracle/orahdfs-3.0.0/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ . . .
Generic Options and User Commands
Identifies the name of an XML configuration file containing properties needed by the command being executed. See "Configuring Oracle SQL Connector for HDFS."
Assigns a value to a specific property.
Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files. Oracle SQL Connector for HDFS also checks the database to ensure that the required database directories exist and that you have the necessary permissions.
For partitioned Hive tables, Oracle SQL Connector for HDFS creates external tables, views, and a metadata table. See Table 2-2.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. The --output
option writes the table DDL from the -createTable
command to a file. Oracle recommends that you first execute a -createTable
command with --noexecute
.
Deletes one or more Oracle Database objects created by Oracle SQL Connector for HDFS to support a particular data source. Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources. An error occurs if you attempt to drop a table or view that Oracle SQL Connector for HDFS did not create.
Use the --noexecute
option to list the objects to be deleted.
Provides information about the Oracle Database objects created by Oracle SQL Connector for HDFS. Use this command instead of -getDDL
or -listLocations
.
Publishes the data URIs to the location files of an existing external table. Use this command after adding new data files, so that the existing external table can access them.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE
command and location files. The report also shows any errors.
Oracle recommends that you first execute a -publish
command with --noexecute
.
Shows the location file content as text. With the --details
option, this command provides a detailed listing. This command is deprecated in release 3.0. Use -describe instead.
Prints the table definition of an existing external table. This command is deprecated in release 3.0. Use -describe instead.
See Also:
"Syntax Conventions"You can create external tables automatically using the ExternalTable
tool provided in Oracle SQL Connector for HDFS.
To create an external table using the ExternalTable
tool, follow the instructions for your data source:
When the ExternalTable -createTable
command finishes executing, the external table is ready for use. ExternalTable
also manages the location files for the external table. See "Location File Management."
To create external tables manually, follow the instructions in "Creating External Tables in SQL."
ExternalTable Syntax for -createTable
Use the following syntax to create an external table and populate its location files:
hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute]
See Also:
"ExternalTable Command-Line Tool Syntax"Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.
Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER
access driver with the preprocessor
access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA
, which enables ORACLE_LOADER
to read the Data Pump format files generated by Oracle Loader for Hadoop.
To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Note:
Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go tohttp://support.oracle.com
and search for bug 14557588.
Release 11.2.0.3 and later releases do not require this patch.
These properties are required:
oracle.hadoop.exttab.sourceType=datapump
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
Example 2-2 is an XML template containing the properties that describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-2 XML Template with Properties for a Data Pump Format File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>datapump</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.logDirectory</name> <value>value</value> </property> </configuration>
Example 2-3 creates an external table named SALES_DP_XTAB
to read Data Pump files.
Example 2-3 Defining an External Table for Data Pump Format Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory. For Oracle RAC, you must create a clusterwide directory on a distributed file system.
$ mkdir /data/sales_dp_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/data/sales_dp_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.
To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH
variable. Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.
Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH
environment variable. You must have a correctly functioning Hive client.
For Hive managed tables, the data paths come from the warehouse directory.
For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.
The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable
tool to drop the existing external table and create a new one.
To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Oracle SQL Connector for HDFS supports nonpartitioned Hive tables that are defined using ROW FORMAT DELIMITED
and FILE FORMAT TEXTFILE
clauses. Both Hive-managed tables and Hive external tables are supported.
Oracle SQL Connector for HDFS also supports partitioned tables. See "Creating External Tables from Partitioned Hive Tables".
Hive tables can be either bucketed or not bucketed. All primitive types from Hive 0.10.0 are supported.
Table 2-1 shows the default data-type mappings between Hive and Oracle. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.hive.columnType.*
properties listed under "Optional Properties.".
Table 2-1 Hive Data Type Mappings
Data Type of Source Hive Column | Default Data Type of Target Oracle Column |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These properties are required for Hive table sources:
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
These properties are optional for Hive table sources:
Example 2-4 is an XML template containing the properties that describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-4 XML Template with Properties for a Hive Table
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>hive</value> </property> <property> <name>oracle.hadoop.exttab.hive.partitionFilter</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.databaseName</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.columnType.TYPE</name> <value>value</value> </property> </configuration>
Example 2-5 creates an external table named SALES_HIVE_XTAB
to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.
Example 2-5 Defining an External Table for a Hive Table
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /data/sales_hive_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/data/sales_hive_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*:/usr/lib/hive/lib/*:/etc/hive/conf"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Oracle SQL Connector for HDFS supports partitioned Hive tables, enabling you to query a single partition, a range of partitions, or all partitions. You can represent all Hive partitions or a subset of them in Oracle Database.
See Also:
"Creating External Tables from Hive Tables" for required properties, data type mappings, and other details applicable to all Hive table access using Oracle SQL Connector for HDFS.To support a partitioned Hive table, Oracle SQL Connector for HDFS creates the objects described in Table 2-2.
Table 2-2 Oracle Database Objects for Supporting a Partitioned Hive Table
Database Object | Description | Naming Convention |
---|---|---|
External Tables |
One for each Hive partition |
For example, |
Views |
One for each external table. Used for querying the Hive data. |
For example, |
Metadata Table |
One for the Hive table. Identifies all external tables and views associated with a particular Hive table. Specify this table when creating, describing, or dropping these database objects. |
For example, |
For example, if a Hive table comprises five partitions, then Oracle SQL Connector for HDFS creates five external tables, five views, and one metadata table in Oracle Database.
To drop the objects described in Table 2-2 and the location files, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
The metadata table provides critical information about how to query the Hive table. Table 2-3 describes the columns of a metadata table.
Table 2-3 Metadata Table Columns
Column | Description |
---|---|
|
The view in Oracle Database used to access a partition of a Hive table |
|
The external table in Oracle Database used to access the data in a Hive partition. |
|
The partitioned Hive table being accessed through Oracle Database. |
|
The Hive database where the table resides. |
|
The Hive partition filter used to select a subset of partitions for access by Oracle Database. A |
Partition Columns |
Each column used to partition the Hive table has a separate column in the metadata table. For example, the metadata table has columns for |
The following SELECT
statement queries a metadata table named HIVE_SALES_DATA
:
SQL> SELECT view_name, ext_table_name, Hive_table_name, \ hive_db_name, country, city \ FROM hive_sales_data \ WHERE state = 'TEXAS';
The results of the query identify three views with data from cities in Texas:
VIEW_NAME EXT_TABLE_NAME HIVE_TABLE_NAME HIVE_DB_NAME COUNTRY CITY ------------------------------------------------------------------------------------------ HIVE_SALES_DATA_1 OSCHHIVE_SALES_DATA_1 hive_sales_data db_sales US AUSTIN HIVE_SALES_DATA_2 OSCHHIVE_SALES_DATA_2 hive_sales_data db_sales US HOUSTON HIVE_SALES_DATA_3 OSCHHIVE_SALES_DATA_3 hive_sales_data db_sales US DALLAS
To facilitate querying, you can create UNION ALL
views over the individual partition views. Use the mkhive_unionall_view.sql
script, which is provided in the OSCH_HOME/example/sql
directory. To maintain performance, do not create UNION ALL
views over more than 50 to 100 views (depending on their size).
To use mkhive_unionall_view.sql
, use the following syntax:
@mkhive_unionall_view[.sql] table schema view predicate
MKHIVE_UNIONALL_VIEW Script Parameters
The name of the metadata table in Oracle Database that represents a partitioned Hive table. Required.
The owner of the metadata table. Optional; defaults to your schema.
The name of the UNION ALL
view created by the script. Optional; defaults to table_
ua
.
A WHERE
condition used to select the partitions in the Hive table to include in the UNION ALL
view. Optional; defaults to all partitions.
Example 2-6 Union All Views for Partitioned Hive Tables
The following example creates a UNION ALL
view named HIVE_SALES_DATA_UA
, which accesses all partitions listed in the HIVE_SALES_DATA
metadata table:
SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null null null
This example creates a UNION ALL
view named ALL_SALES
, which accesses all partitions listed in the HIVE_SALES_DATA
metadata table:
SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null ALL_SALES null
The next example creates a UNION ALL
view named TEXAS_SALES_DATA
, which accesses the rows of all partitions where STATE = 'TEXAS'
.
SQL> @mkhive_unionallview.sql HIVE_SALES_DATA null TEXAS_SALES_DATA '(STATE = ''''TEXAS'''')'
oracle.hadoop.exttab.tableName
property. Oracle SQL Connector for HDFS generates external table names using the convention OSCH
table_name_n
. See Table 2-2.hadoop -drop
command to drop the existing tables and views, and then retry the -createTable
command. If this solution fails, then you might have "dangling" objects. See "Dropping Dangling Objects".Always use Oracle SQL Connector for HDFS commands to manage objects created by the connector to support partitioned Hive tables. Dangling objects are caused when you use the SQL drop table
command to drop a metadata table instead of the -drop
command. If you are unable to drop the external tables and views for a partitioned Hive table, then they are dangling objects.
Notice the schema and table names in the error message generated when you attempted to drop the objects, and use them in the following procedure.
To drop dangling database objects:
Open a SQL session with Oracle Database, and connect as the owner of the dangling objects.
Identify the location files of the external table by querying the ALL_EXTERNAL_LOCATIONS
and ALL_EXTERNAL_TABLES
data dictionary views:
SELECT a.table_name, a.directory_name, a.location \ FROM all_external_locations a, all_external_tables b \ WHERE a.table_name = b.table_name AND a.table_name \ LIKE 'OSCHtable%' AND a.owner='schema';
In the LIKE
clause of the previous syntax, replace table and schema with the appropriate values.
In the output, the location file names have an osch-
prefix, such as osch-20140408014604-175-1
.
Identify the external tables by querying the ALL_EXTERNAL_TABLES
data dictionary view:
SELECT table_name FROM all_external_tables \ WHERE table_name \ LIKE 'OSCHtable%' AND owner=schema;
Identify the database views by querying the ALL_VIEWS
data dictionary view:
SELECT view_name FROM all_views WHERE view_name LIKE 'table%' AND owner='schema';
Inspect the tables, views, and location files to verify that they are not needed, using commands like the following:
DESCRIBE schema.table; SELECT * FROM schema.table; DESCRIBE schema.view; SELECT * FROM schema.view;
Delete the location files, tables, and views that are not needed, using commands like the following:
EXECUTE utl_file.fremove('directory', 'location_file'); DROP TABLE schema.table; DROP VIEW schema.view;
Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. By default, all text columns in the external table are VARCHAR2
. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount
property.
All text data sources are automatically mapped to VARCHAR2(4000)
. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.colMap.*
properties listed under "Optional Properties."
These properties are required for delimited text sources:
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
These properties are optional for delimited text sources:
Example 2-7 is an XML template containing all the properties that describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-7 XML Template with Properties for a Delimited Text File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <!-- Use either columnCount or columnNames --> <property> <name>oracle.hadoop.exttab.columnCount</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.colMap.TYPE</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.recordDelimiter</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.initialFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.trailingFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> </configuration>
Example 2-8 creates an external table named SALES_DT_XTAB
from delimited text files.
Example 2-8 Defining an External Table for Delimited Text Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /data/sales_dt_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/data/sales_dt_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable
command.
Additional syntax might not be supported for Data Pump format files.
To create an external table manually:
Use the -createTable --noexecute
command to generate the external table DDL.
Make whatever changes are needed to the DDL.
Run the DDL from Step 2 to create the table definition in the Oracle database.
Use the ExternalTable -publish
command to publish the data URIs to the location files of the external table.
The -createTable
command creates the metadata in Oracle Database for delimited text and Data Pump sources, and populates the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS.You might publish the URIs as a separate step from creating the external table in cases like these:
You want to publish new data into an already existing external table.
You created the external table manually instead of using the ExternalTable
tool.
In both cases, you can use ExternalTable
with the -publish
command to populate the external table location files with the URIs of the data files in HDFS. See "Location File Management".
Note:
Thepublish
option is supported for delimited text and Data Pump sources. It is not supported for Hive sources. Use the -drop
and -createTable
commands of the ExternalTable
tool for Hive sources.hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -publish [--noexecute]
See Also:
"ExternalTable Command-Line Tool Syntax"Example 2-9 sets HADOOP_CLASSPATH
and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.
Example 2-9 Publishing HDFS Data Paths to an External Table for Data Pump Format Files
This example uses the Bash shell.
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.exttab.sourceType=datapump \ -D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -publish
In this example:
OSCH_HOME
is the full path to the Oracle SQL Connector for HDFS installation directory.
SALES_DP_XTAB
is the external table created in Example 2-3.
hdfs:/user/scott/data/
is the location of the HDFS data.
@myhost:1521
is the database connection string.
The -describe
command is a debugging and diagnostic utility that prints the definition of an existing external table. It also enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.
These properties are required to use this command:
The JDBC connection properties; see "Connection Properties."
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -describe
See Also:
"ExternalTable Command-Line Tool Syntax"Example 2-10 shows the command syntax to describe the external tables and location files associated with SALES_DP_XTAB
.
Example 2-10 Exploring External Tables and Location Files
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -describe
The -drop
command deletes the database objects created by Oracle SQL Connector for HDFS. These objects include external tables, location files, and views. When dropping objects associated with a partitioned Hive table, problems arise if you delete objects manually, as described in "Dropping Dangling Objects". However, -drop
also discards single external tables and location files that support other data source types.
The -drop
command only deletes objects created by Oracle SQL Connector for HDFS.
These properties are required to use this command:
The JDBC connection properties; see "Connection Properties."
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -drop
See Also:
"ExternalTable Command-Line Tool Syntax"Example 2-10 shows the command syntax to drop the database objects associated with SALES_DP_XTAB
.
Example 2-11 Dropping Database Objects
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -drop
Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.
Oracle SQL Connector for HDFS uses the ORACLE_LOADER
access driver. The hdfs_stream
preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER
can process.
See Also:
Oracle Database Administrator's Guide for information about external tables
Oracle Database Utilities for more information about external tables, performance hints, and restrictions when you are using the ORACLE_LOADER
access driver.
Oracle SQL Connector for HDFS uses default data type mappings to create columns in an Oracle external table with the appropriate data types for the Hive and text sources. You can override these defaults by setting various configuration properties, for either all columns or a specific column.
For example, a field in a text file might contain a timestamp. By default, the field is mapped to a VARCHAR2
column. However, you can specify a TIMESTAMP
column and provide a datetime mask to cast the values correctly into the TIMESTAMP
data type. The TIMESTAMP
data type supports time-based queries and analysis that are unavailable when the data is presented as text.
Text sources are mapped to VARCHAR2
columns, and Hive columns are mapped to columns with the closest equivalent Oracle data type. Table 2-1 shows the default mappings.
The following properties apply to all columns in the external table. For Hive sources, these property settings override the oracle.hadoop.exttab.hive.*
property settings.
The following properties apply to only one column, whose name is the column_name part of the property name. These property settings override all other settings.
The following properties create an external table in which all columns are the default VARCHAR2
data type:
oracle.hadoop.exttab.tableName=MOVIE_FACT_EXT_TAB_TXT oracle.hadoop.exttab.columnNames=CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES
In this example, the following properties are set to override the data type of several columns:
oracle.hadoop.exttab.colMap.TIME_ID.columnType=TIMESTAMP oracle.hadoop.exttab.colMap.RECOMMENDED.columnType=NUMBER oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType=NUMBER oracle.hadoop.exttab.colMap.RATING.columnType=NUMBER oracle.hadoop.exttab.colMap.SALES.columnType=NUMBER
Oracle SQL Connector for HDFS creates an external table with the specified data types:
SQL> DESCRIBE movie_facts_ext
Name Null? Type
----------------------------------------- -------- ----------------------------
CUST_ID VARCHAR2(4000)
MOVIE_ID VARCHAR2(4000)
GENRE_ID VARCHAR2(4000)
TIME_ID TIMESTAMP(9)
RECOMMENDED NUMBER
ACTIVITY_ID NUMBER
RATINGS NUMBER
SALES NUMBER
The next example adds the following property settings to change the length of the VARCHAR2
columns:
oracle.hadoop.exttab.colMap.CUST_ID.columnLength=12 oracle.hadoop.exttab.colMap.MOVIE_ID.columnLength=12 oracle.hadoop.exttab.colMap.GENRE_ID.columnLength=12
All columns now have custom data types:
SQL> DESCRIBE movie_facts_ext
Name Null? Type
----------------------------------------- -------- ----------------------------
CUST_ID VARCHAR2(12)
MOVIE_ID VARCHAR2(12)
GENRE_ID VARCHAR2(12)
TIME_ID TIMESTAMP(9)
RECOMMENDED NUMBER
ACTIVITY_ID NUMBER
RATINGS NUMBER
SALES NUMBER
A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.
To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.
Ideally, you can decide to run at a particular degree of parallelism and create a number of location files that are a multiple of the degree of parallelism, as described in the following procedure.
To set up parallel processing for maximum performance:
Identify the maximum DOP that your Oracle DBA will permit you to use when running Oracle SQL Connector for HDFS.
When loading a huge amount of data into an Oracle database, you should also work with the DBA to identify a time when the maximum resources are available.
Create a number of location files that is a small multiple of the DOP. For example, if the DOP is 8, then you might create 8, 16, 24, or 32 location files.
Create a number of HDFS files that are about the same size and a multiple of the number of location files. For example, if you have 32 location files, then you might create 128, 1280, or more HDFS files, depending on the amount of data and the minimum HDFS file size.
Set the DOP for the data load, using either the ALTER SESSION
command or hints in the SQL SELECT
statement.
This example sets the DOP to 8 using ALTER SESSION
:
ALTER SESSION FORCE PARALLEL DML PARALLEL 8; ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8;
The next example sets the DOP to 8 using the PARALLEL
hint:
INSERT /*+ parallel(my_db_table,8) */ INTO my_db_table \ SELECT /*+ parallel(my_hdfs_external_table,8) */ * \ FROM my_hdfs_external_table;
An APPEND
hint in the SQL INSERT
statement can also help improve performance.
The Oracle SQL Connector for HDFS command-line tool, ExternalTable
, creates an external table and publishes the HDFS URI information to location files. The external table location files are stored in the directory specified by the
property. For an Oracle RAC database, this directory must reside on a distributed file system that is accessible to each database server.oracle.hadoop.exttab.defaultDirectory
ExternalTable
manages the location files of the external table, which involves the following operations:
Generating new location files in the database directory after checking for name conflicts
Deleting existing location files in the database directory as necessary
Publishing data URIs to new location files
Altering the LOCATION
clause of the external table to match the new location files
Location file management for the supported data sources is described in the following topics.
The ORACLE_LOADER
access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.
Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.
The ORACLE_LOADER
access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount
configuration property.
This is the format of a location file name:
osch-
timestamp-number-n
In this syntax:
timestamp
has the format yyyyMMddhhmmss
, for example, 20121017103941 for October 17, 2012, at 10:39:41.
number
is a random number used to prevent location file name conflicts among different tables.
n
is an index used to prevent name conflicts between location files for the same table.
For example, osch-20121017103941-6807-1.
You can pass configuration properties to the ExternalTable
tool on the command line with the -D
option, or you can create a configuration file and pass it on the command line with the -conf
option. These options must precede the command to be executed.
For example, this command uses a configuration file named example.xml
:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-conf /home/oracle/example.xml \
-createTable
See "ExternalTable Command-Line Tool Syntax".
A configuration file is an XML document with a very simple structure as follows:
<?xml version="1.0"?> <configuration> <property> <name>property</name> <value>value</value> </property> . . . </configuration>
Example 2-12 shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.
Example 2-12 Configuration File for Oracle SQL Connector for HDFS
<?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>SH.SALES_EXT_DIR</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/data/s1/*.csv,/data/s2/*.csv</value> </property> <property> <name>oracle.hadoop.exttab.dataCompressionCodec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>SH</value> </property> </configuration>
The following is a complete list of the configuration properties used by the ExternalTable
command-line tool. The properties are organized into these categories:
Specifies the length of all external table columns of type CHAR
, VARCHAR2
, NCHAR
, NVARCHAR2
, and RAW
. Optional.
Default Value: The maximum length allowed by the column type
For Oracle Database 12c, Oracle SQL Connector for HDFS sets the length of VARCHAR2
, NVARCHAR2
, and RAW
columns depending on whether the database MAX_STRING_SIZE
option is set to STANDARD
or EXTENDED
.
Valid values: Integer
Specifies the data type mapping of all columns for Hive and text sources. Optional.
You can override this setting for specific columns by setting oracle.hadoop.exttab.colMap.column_name.columnType
.
Default value: VARCHAR2
for text; see Table 2-1 for Hive
Valid values: The following Oracle data types are supported:
VARCHAR2
NVARCHAR2
CHAR
NCHAR
CLOB
NCLOB
NUMBER
INTEGER
FLOAT
BINARY_DOUBLE
BINARY_FLOAT
RAW
*DATE
TIMESTAMP
TIMESTAMP WITH TIME ZONE
TIMESTAMP WITH LOCAL TIME ZONE
INTERVAL DAY TO SECOND
INTERVAL YEAR TO MONTH
RAW
binary data in delimited text files must be encoded in hexadecimal.Specifies the format mask used in the date_format_spec clause of the external table for all DATE
columns. This clause indicates that a character data field contains a date in the specified format.
Default value: The default globalization format mask, which is set by the NLS_DATE_FORMAT
database parameter
Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
Sets the character buffer length used by the ORACLE_LOADER
access driver for all CLOB
columns. The value is used in the field_list clause of the external table definition, which identifies the fields in the data file and their data types.
Default value: 4000 bytes
Valid values: Integer
Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP
and TIMESTAMP WITH LOCAL TIME ZONE
columns. This clause indicates that a character data field contains a timestamp in the specified format.
Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_FORMAT
database parameter
Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP WITH TIME ZONE
columns. This clause indicates that a character data field contains a timestamp in the specified format.
Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_TZ_FORMAT
database parameter
Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
Specifies the length of all external table columns of type CHAR
, VARCHAR2
, NCHAR
, NVARCHAR2
, and RAW
. Optional.
Default Value: The value of oracle.hadoop.exttab.colMap.columnLength
; if that property is not set, then the maximum length allowed by the data type
Valid values: Integer
Overrides the data type mapping for column_name. Optional.
The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames
.
Default value: The value of oracle.hadoop.exttab.colMap.columnType
; if that property is not set, then the default data type identified in Table 2-1
Valid values: See oracle.hadoop.exttab.colMap.columnType
Overrides the format mask for column_name. Optional.
The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames
.
Default value: The value of oracle.hadoop.exttab.colMap.dateMask.
Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
Overrides the character buffer length used by the ORACLE_LOADER
access driver for column_name. This property is especially useful for CLOB
and extended data type columns. Optional.
The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames
.
Default value: Oracle SQL Connector for HDFS sets the default field lengths as shown in Table 2-4.
Table 2-4 Field Length Calculations
Data Type of Target Column | Field Length |
---|---|
|
Value of |
|
2 * |
|
|
All other types |
255 (default size for external tables) |
Valid values: Integer
Overrides the format mask for column_name. Optional.
The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames
.
Default value: The value of oracle.hadoop.exttab.colMap.timestampMask
.
Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
Overrides the format mask for column_name. Optional.
The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames
.
Default value: The value of oracle.hadoop.exttab.colMap.timestampTZMask
.
Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.
The number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property.
This property is ignored if oracle.hadoop.exttab.columnNames
is set.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
You must set either this property or oracle.hadoop.exttab.columnNames
when creating an external table from delimited text files.
A comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the oracle.hadoop.exttab.columnCount
property.
The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
You must set either this property or oracle.hadoop.exttab.columnCount
when creating an external table from delimited text files.
The name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional.
This property specifies the class name of any compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec
interface. This codec applies to all data files.
Several standard codecs are available in Hadoop, including the following:
To use codecs that may not be available on your Hadoop cluster (such as Snappy), you must first download, install, and configure them individually on your system.
Default value: None
A comma-separated list of fully qualified HDFS paths. This property enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-5. This property is required for the -createTable
and -publish
commands using Data Pump or delimited text files. The property is ignored for Hive data sources.
For example, to select all files in /data/s2/
, and only the CSV files in /data/s7/
, /data/s8/
, and /data/s9/
, enter this expression:
/data/s2/,/data/s[7-9]/*.csv
The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.
The data set can contain compressed files or uncompressed files, but not both.
Table 2-5 Pattern-Matching Characters
Character | Description |
---|---|
? |
Matches any single character |
* |
Matches zero or more characters |
[ |
Matches a single character from the character set {a, b, c} |
[ |
Matches a single character from the character range {a...b}. The character a must be less than or equal to b. |
[^ |
Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket. |
\ |
Removes any special meaning of character c. The backslash is the escape character. |
{ |
Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator. |
{ |
Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator. |
The path filter class. This property is ignored for Hive data sources.
Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered.
Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object.
Valid value: The name of an existing database directory
Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names.
The -createTable
command requires this property.
Specifies the field terminator for an external table when oracle.hadoop.exttab.sourceType=text
. Optional.
Default value: , (comma)
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u
. For example, \t
represents a tab.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. For example, \u0009
represents a tab. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Maps a Hive data type to an Oracle data type. The property name identifies the Hive data type, and its value is an Oracle data type. The target columns in the external table are created with the Oracle data type indicated by this property.
When Hive TIMESTAMP
column is mapped to an Oracle TIMESTAMP
column, then the format mask is YYYY-MM-DD H24:MI:SS.FF.
When a Hive STRING
column is mapped to an Oracle TIMESTAMP
column, then the NLS parameter settings for the database are used by default. You can override these defaults by using either the oracle.hadoop.exttab.colMap.timestampMask
or oracle.hadoop.exttab.colMap.timestampTZMask
properties.
Default values: Table 2-6 lists the Hive column type properties and their default values.
Valid values: See the valid values for oracle.hadoop.exttab.colMap.columnType
.
Table 2-6 Hive Column Type Mapping Properties
Property | Default Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The name of a Hive database that contains the input data table.
The -createTable
command requires this property when oracle.hadoop.exttab.sourceType=hive
.
A valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the table is not partitioned. For additional information, see "oracle.hadoop.loader.input.hive.partitionFilter". The two properties are identical.
Default value: None. All partitions of the Hive table are mapped to external tables.
Valid values: A valid HiveQL expression. Hive user-defined functions (UDFs) and Hive variables are not supported.
The name of an existing Hive table.
The -createTable
command requires this property when oracle.hadoop.exttab.sourceType=hive
.
Specifies the initial field encloser for an external table created from delimited text files. Optional.
Default value: null; no enclosers are specified for the external table definition.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u
.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files.
Default value: 4
This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:
The number of data files
The value of this property
At least one location file is created.
See "Enabling Parallel Processing" for more information about the number of location files.
Specifies a database directory where log files, bad files, and discard files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by _%p.log
.
This is an optional property for the -createTable
command.
These are the default file name extensions:
Log files: log
Bad files: bad
Discard files: dsc
Valid values: An existing Oracle directory object name.
Unquoted names are changed to uppercase. Quoted names are not changed. Table 2-7 provides examples of how values are transformed.
Specifies the database directory for the preprocessor. The file-system directory must contain the hdfs_stream
script.
Default value: OSCH_BIN_PATH
The preprocessor directory is used in the PREPROCESSOR
clause of the external table.
Specifies the record delimiter for an external table created from delimited text files. Optional.
Default value: \n
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the type of source files. The -createTable
and -publish
operations require the value of this property.
Default value: text
Valid values: datapump
, hive
, or text
Indicates whether the lengths specified for character strings are bytes or characters. This value is used in the STRING SIZES ARE IN
clause of the external table. Use characters when loading multibyte character sets. See Oracle Database Utilities.
Default value: BYTES
Valid values: BYTES
or CHARACTERS
Schema-qualified name of the external table, or the metadata table for partitioned Hive tables, in this format:
schemaName.tableName
If you omit schemaName, then the schema name defaults to the connection user name.
Default value: none
Required property for all operations.
Specifies the trailing field encloser for an external table created from delimited text files. Optional.
Default value: null; defaults to the value of oracle.hadoop.exttab.initialFieldEncloser
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the database connection string in the thin-style service name format:
jdbc:oracle:thin:@//host_name:port/service_name
If you are unsure of the service name, then enter this SQL command as a privileged user:
SQL> show parameter service
If an Oracle wallet is configured as an external password store, then the property value must start with the driver prefix jdbc:oracle:thin:@
and db_connect_string
must exactly match the credentials defined in the wallet.
This property takes precedence over all other connection properties.
Default value: Not defined
Valid values: A string
An Oracle database log-in name. The externalTable
tool prompts for a password. This property is required unless you are using Oracle wallet as an external password store.
Default value: Not defined
Valid values: A string
Specifies a TNS entry name defined in the tnsnames.ora file.
This property is used with the oracle.hadoop.connection.tns_admin
property.
Default value: Not defined
Valid values: A string
Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java oracle.net.tns_admin
property. When both are set, this property takes precedence over oracle.net.tns_admin
.
This property must be set when using Oracle Wallet as an external password store. See oracle.hadoop.connection.wallet_location
.
Default value: The value of the Java oracle.net.tns_admin
system property
Valid values: A string
A file path to an Oracle wallet directory where the connection credential is stored.
Default value: Not defined
Valid values: A string
When using Oracle Wallet as an external password store, set these properties:
Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND
and PQ_DISTRIBUTE
also improve performance when you are inserting data.