This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.
This chapter contains the following sections:
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Apache Hadoop in these formats:
Data Pump files in HDFS
Delimited text files in HDFS
Delimited text files in Apache Hive tables
For other file formats, such as JSON files, you can stage the input as delimited text in a new Hive table and then use Oracle SQL Connector for HDFS. Partitioned Hive tables are supported, enabling you to represent a subset of Hive table partitions in Oracle Database, instead of the entire Hive table.
Oracle SQL Connector for HDFS uses external tables and database views to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. Oracle SQL Connector for HDFS creates database views over external tables to support access to partitioned Hive tables. By querying the external tables or views, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.
To create these objects in Oracle Database, you use the ExternalTable
command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable
with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as options to the ExternalTable
command or in an XML file. The ExternalTable command-line tool can be used from either a shell or from SQL Developer.
When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.
You can also perform bulk loads of data into Oracle database tables using SQL.You may prefer that the data resides in an Oracle database (either all of it or just a selection) if it is queried routinely. Oracle SQL Connector for HDFS functions as a Hadoop client running on the Oracle database and uses the external table preprocessor hdfs_stream
to access data in HDFS. Oracle Shell for Hadoop Loaders has commands to create and external table and do a bulk load in one step.
See Also:
The following Oracle blog provides information on using Oracle SQL Developer: https://blogs.oracle.com/bigdataconnectors/move-data-between-apache-hadoop-and-oracle-database-with-sql-developerThe following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.
Example 2-1 Accessing HDFS Data Files from Oracle Database
The following illustrates these steps:
$ cat moviefact_hdfs.sh # Add environment variables export OSCH_HOME="/u01/connectors/orahdfs-<version>" hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -conf /home/oracle/movies/moviefact_hdfs.xml \ -createTable $ cat moviefact_hdfs.xml <?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>MOVIE_FACTS_EXT</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>4</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/user/oracle/moviework/data/part*</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>\u0009</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>MOVIEDEMO_DIR</value> </property> <property> <name>oracle.hadoop.exttab.nullIfSpecifier</name> <value>\N</</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value> </property> <property> <name>oracle.hadoop.exttab.colMap.TIME_ID.columnType</name> <value>TIMESTAMP</value> </property> <property> <name>oracle.hadoop.exttab.colMap.timestampMask</name> <value>YYYY-MM-DD:HH:MI:SS</value> </property> <property> <name>oracle.hadoop.exttab.colMap.RECOMMENDED.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.RATING.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.colMap.SALES.columnType</name> <value>NUMBER</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>text</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@localhost:1521:orcl</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>MOVIEDEMO</value> </property> </configuration> $ sh moviefact_hdfs.sh Oracle SQL Connector for HDFS Release 3.4.0 - Production Copyright (c) 2011, 2015, Oracle and/or its affiliates. All rights reserved. [Enter Database Password: password] The create table command succeeded. CREATE TABLE "MOVIEDEMO"."MOVIE_FACTS_EXT" ( "CUST_ID" VARCHAR2(4000), "MOVIE_ID" VARCHAR2(4000), "GENRE_ID" VARCHAR2(4000), "TIME_ID" TIMESTAMP(9), "RECOMMENDED" NUMBER, "ACTIVITY_ID" NUMBER, "RATING" NUMBER, "SALES" NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "MOVIEDEMO_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'09' MISSING FIELD VALUES ARE NULL ( "CUST_ID" CHAR(4000), "MOVIE_ID" CHAR(4000), "GENRE_ID" CHAR(4000), "TIME_ID" CHAR, "RECOMMENDED" CHAR, "ACTIVITY_ID" CHAR, "RATING" CHAR, "SALES" CHAR ) ) LOCATION ( 'osch-20141114064206-5250-1', 'osch-20141114064206-5250-2', 'osch-20141114064206-5250-3', 'osch-20141114064206-5250-4' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20141114064206-5250-1 contains 1 URI, 12754882 bytes 12754882 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00001 osch-20141114064206-5250-2 contains 1 URI, 438 bytes 438 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00002 osch-20141114064206-5250-3 contains 1 URI, 432 bytes 432 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00003 osch-20141114064206-5250-4 contains 1 URI, 202 bytes 202 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00004 $ sqlplus moviedemo SQL*Plus: Release 12.1.0.1.0 Production on Fri Apr 18 09:24:18 2014 Copyright (c) 1982, 2013, Oracle. All rights reserved. Enter password: password Last Successful login time: Thu Apr 17 2014 18:42:01 -05:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> DESCRIBE movie_facts_ext; Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID VARCHAR2(4000) MOVIE_ID VARCHAR2(4000) GENRE_ID VARCHAR2(4000) TIME_ID TIMESTAMP(9) RECOMMENDED NUMBER ACTIVITY_ID NUMBER RATING NUMBER SALES NUMBER SQL> CREATE TABLE movie_facts AS SELECT * FROM movie_facts_ext; Table created. SQL> SELECT movie_id, time_id, recommended, rating FROM movie_facts WHERE rownum < 5; MOVIE_ID TIME_ID RECOMMENDED RATING -------- -------------------------------- ----------- ---------- 205 03-DEC-10 03.14.54.000000000 AM 1 1 77 14-AUG-11 10.46.55.000000000 AM 1 3 116 24-NOV-11 05.43.00.000000000 AM 1 5 141 01-JAN-11 05.17.57.000000000 AM 1 4
You can run the ExternalTable
command-line tool provided with Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:
For Hive sources, log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.
For text and Data Pump format files, log in to either the Oracle Database system or a node in the Hadoop cluster.
Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH
environment variable on the system where you log in to run the tool. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "OS-Level Requirements for OSCH Users".
Setting up the environment variables:
Verify that HADOOP_CLASSPATH
includes the path to the JAR files for Oracle SQL Connector for HDFS:
path/orahdfs-<version>/jlib/*
If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH
also includes the Hive JAR files and conf directory. For example:
/usr/lib/hive/lib/* /etc/hive/conf
For your convenience, you can create an OSCH_HOME
environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:
$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
See Also:
"Oracle SQL Connector for Hadoop Distributed File System Setup" for instructions for installing the software and setting up user accounts on both systems.
OSCH_HOME
/doc/README.txt
for information about known problems with Oracle SQL Connector for HDFS.
Oracle SQL Connector for HDFS is a command-line utility that accepts generic command line arguments supported by the org.apache.hadoop.util.Tool interface. It also provides a preprocessor for Oracle external tables. See the Oracle Big Data Appliance Software User's Guide for instructions on configuring Oracle Exadata Database Machine for Use with Oracle Big Data Appliance.
Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable
. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.
The ExternalTable
tool uses the values of several properties to do the following tasks:
Create an external table
Populate the location files
Publish location files to an existing external table
List the location files
Describe an external table
You can specify these property values in an XML document or individually on the command line. .
Related Topics
This is the full syntax of the ExternalTable
command-line tool, which is run using the hadoop
command:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute [--output filename.sql]] | -drop [--noexecute] | -describe | -publish [--noexecute] | -listlocations [--details] | -getDDL
You can either create the OSCH_HOME
environment variable or replace OSCH_HOME
in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:
/opt/oracle/orahdfs-<version>
For example, you might run the ExternalTable
command-line tool with a command like this:
hadoop jar /opt/oracle/orahdfs-<version>/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ . . .
Generic Options and User Commands
Identifies the name of an XML configuration file containing properties needed by the command being executed.
Assigns a value to a specific property.
Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files. Oracle SQL Connector for HDFS also checks the database to ensure that the required database directories exist and that you have the necessary permissions.
For partitioned Hive tables, Oracle SQL Connector for HDFS creates external tables, views, and a metadata table. See Table 2-2.
Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. The --output
option writes the table DDL from the -createTable
command to a file. Oracle recommends that you first execute a -createTable
command with --noexecute
.
Deletes one or more Oracle Database objects created by Oracle SQL Connector for HDFS to support a particular data source. Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources. An error occurs if you attempt to drop a table or view that Oracle SQL Connector for HDFS did not create.
Use the --noexecute
option to list the objects to be deleted.
Provides information about the Oracle Database objects created by Oracle SQL Connector for HDFS. Use this command instead of -getDDL
or -listLocations
.
Publishes the data URIs to the location files of an existing external table. Use this command after adding new data files, so that the existing external table can access them.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE
command and location files. The report also shows any errors.
Oracle recommends that you first execute a -publish
command with --noexecute
.
See "Updating External Tables."
Shows the location file content as text. With the --details
option, this command provides a detailed listing. This command is deprecated in release 3.0. Use “-describe” instead.
Prints the table definition of an existing external table. This command is deprecated in release 3.0. Use “-describe” instead.
Related Topics
See Also:
You can create external tables automatically using the ExternalTable
tool provided in Oracle SQL Connector for HDFS.
To create an external table using the ExternalTable
tool, follow the instructions for your data source:
When the ExternalTable -createTable
command finishes executing, the external table is ready for use. ExternalTable
also manages the location files for the external table. See "Location File Management."
To create external tables manually, follow the instructions in "Creating External Tables in SQL."
ExternalTable Syntax for -createTable
Use the following syntax to create an external table and populate its location files:
hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute]
Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.
Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER
access driver with the preprocessor
access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA
, which enables ORACLE_LOADER
to read the Data Pump format files generated by Oracle Loader for Hadoop.
To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Note:
Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com
and search for bug 14557588.
Release 11.2.0.3 and later releases do not require this patch.
These properties are required:
oracle.hadoop.exttab.sourceType=datapump
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
The following example is an XML template containing the properties that describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-2 XML Template with Properties for a Data Pump Format File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>datapump</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.logDirectory</name> <value>value</value> </property> </configuration>
The following example creates an external table named SALES_DP_XTAB
to read Data Pump files.
Example 2-3 Defining an External Table for Data Pump Format Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory. For Oracle RAC, you must create a clusterwide directory on a distributed file system.
$ mkdir /data/sales_dp_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/data/sales_dp_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:$HADOOP_CLASSPATH"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.
To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH
variable. Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.
Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH
environment variable. You must have a correctly functioning Hive client.
For Hive managed tables, the data paths come from the warehouse directory.
For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.
The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable
tool to drop the existing external table and create a new one.
To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Oracle SQL Connector for HDFS supports Hive tables that are defined using ROW FORMAT DELIMITED
and FILE FORMAT TEXTFILE
clauses. Both Hive-managed tables and Hive external tables are supported.
Oracle SQL Connector for HDFS also supports partitioned Hive tables. In this case Oracle SQL Connector for HDFS creates one or more external tables and database views. See "Creating External Tables from Partitioned Hive Tables".
Hive tables can be either bucketed or not bucketed. All primitive types from Hive 0.10.0 are supported.
The following table shows the default data-type mappings between Hive and Oracle. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.hive.columnType.*
properties listed under "Optional Properties.".
Table 2-1 Hive Data Type Mappings
Data Type of Source Hive Column | Default Data Type of Target Oracle Column |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These properties are required for Hive table sources:
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
The following example is an XML template containing the properties that describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-4 XML Template with Properties for a Hive Table
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>hive</value> </property> <property> <name>oracle.hadoop.exttab.hive.partitionFilter</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.databaseName</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.columnType.TYPE</name> <value>value</value> </property> </configuration>
This example creates an external table named SALES_HIVE_XTAB
to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.
Example 2-5 Defining an External Table for a nonpartitioned Hive Table
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /data/sales_hive_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/data/sales_hive_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:/usr/lib/hive/lib/*:/etc/hive/conf:$HADOOP_CLASSPATH"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Note:
For nonpartitioned Hive tables and other data sources the value for property oracle.hadoop.exttab.tableName
is the name of the external table.
Oracle SQL Connector for HDFS supports partitioned Hive tables, enabling you to query a single partition, a range of partitions, or all partitions. You can represent all Hive partitions or a subset of them in Oracle Database.
See Also:
"Creating External Tables from Hive Tables" for required properties, data type mappings, and other details applicable to all Hive table access using Oracle SQL Connector for HDFS.
To support a partitioned Hive table, Oracle SQL Connector for HDFS creates the objects described in the following table.
Table 2-2 Oracle Database Objects for Supporting a Partitioned Hive Table
Database Object | Description | Naming Convention(1) |
---|---|---|
External Tables |
One for each Hive partition |
For example, |
Views |
One for each external table. Used for querying the Hive data. |
For example, |
Metadata Table |
One for the Hive table. Identifies all external tables and views associated with a particular Hive table. Specify this table when creating, describing, or dropping these database objects. |
For example, |
Footnote 1 The "_n" suffixed with table name indicates a numeric value.
For example, if a Hive table comprises five partitions, then Oracle SQL Connector for HDFS creates five external tables, five views, and one metadata table in Oracle Database.
To drop the objects described in Table 2-2 and the location files, use the -drop
command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".
Note:
For partitioned Hive tables and other data sources the value for property oracle.hadoop.exttab.tableName
is the name of the metadata table.
The metadata table provides critical information about how to query the Hive table. The following table describes the columns of a metadata table.
Table 2-3 Metadata Table Columns
Column | Description |
---|---|
|
The Oracle Database view used to access a single Hive table partition. The view contains both Hive table and partition columns. |
|
An Oracle external table that represents a Hive table partition. The external table contains only the Hive table columns and not the Hive partition columns. To access all the data in a Hive partition, use the corresponding Oracle Database view. |
|
The partitioned Hive table being accessed through Oracle Database. |
|
The Hive database where the table resides. |
|
The Hive partition filter used to select a subset of partitions for access by Oracle Database. A |
Partition Columns |
Each column used to partition the Hive table has a separate column in the metadata table. For example, the metadata table has columns for |
The following SELECT
statement queries a metadata table named HIVE_SALES_DATA
:
SQL> SELECT view_name, ext_table_name, Hive_table_name, \ hive_db_name, country, city \ FROM hive_sales_data \ WHERE state = 'TEXAS';
The results of the query identify three views with data from cities in Texas:
VIEW_NAME EXT_TABLE_NAME HIVE_TABLE_NAME HIVE_DB_NAME COUNTRY CITY ------------------------------------------------------------------------------------------ HIVE_SALES_DATA_1 OSCHHIVE_SALES_DATA_1 hive_sales_data db_sales US AUSTIN HIVE_SALES_DATA_2 OSCHHIVE_SALES_DATA_2 hive_sales_data db_sales US HOUSTON HIVE_SALES_DATA_3 OSCHHIVE_SALES_DATA_3 hive_sales_data db_sales US DALLAS
The views include partition column values. Oracle recommends that you use the views while querying a partitioned Hive table, as the external tables do not include the partition column values.
To facilitate querying, you can create UNION ALL
views over the individual partition views. Use the mkhive_unionall_view.sql
script, which is provided in the OSCH_HOME/example/sql
directory. To maintain performance, do not create UNION ALL
views over more than 50 to 100 views (depending on their size).
To use mkhive_unionall_view.sql
, use the following syntax:
@mkhive_unionall_view[.sql] table schema view predicate
MKHIVE_UNIONALL_VIEW Script Parameters
The name of the metadata table in Oracle Database that represents a partitioned Hive table. Required.
The owner of the metadata table. Optional; defaults to your schema.
The name of the UNION ALL
view created by the script. Optional; defaults to table_
ua
.
A WHERE
condition used to select the partitions in the Hive table to include in the UNION ALL
view. Optional; defaults to all partitions.
Example 2-6 Union All Views for Partitioned Hive Tables
The following example creates a UNION ALL
view named HIVE_SALES_DATA_UA
, which accesses all partitions listed in the HIVE_SALES_DATA
metadata table:
SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null null null
This example creates a UNION ALL
view named ALL_SALES
, which accesses all partitions listed in the HIVE_SALES_DATA
metadata table:
SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null ALL_SALES null
The next example creates a UNION ALL
view named TEXAS_SALES_DATA
, which accesses the rows of all partitions where STATE = 'TEXAS'
.
SQL> @mkhive_unionallview.sql HIVE_SALES_DATA null TEXAS_SALES_DATA '(STATE = ''''TEXAS'''')'
Cause: The names generated for the database objects exceed 30 characters.
Cause: An attempt was made to create external tables for a partitioned Hive table, but the data objects already exist.
Always use Oracle SQL Connector for HDFS commands to manage objects created by the connector to support partitioned Hive tables. Dangling objects are caused when you use the SQL drop table
command to drop a metadata table instead of the -drop
command. If you are unable to drop the external tables and views for a partitioned Hive table, then they are dangling objects.
Notice the schema and table names in the error message generated when you attempted to drop the objects, and use them in the following procedure.
To drop dangling database objects:
Open a SQL session with Oracle Database, and connect as the owner of the dangling objects.
Identify the location files of the external table by querying the ALL_EXTERNAL_LOCATIONS
and ALL_EXTERNAL_TABLES
data dictionary views:
SELECT a.table_name, a.directory_name, a.location \ FROM all_external_locations a, all_external_tables b \ WHERE a.table_name = b.table_name AND a.table_name \ LIKE 'OSCHtable%' AND a.owner='schema';
In the LIKE
clause of the previous syntax, replace table and schema with the appropriate values.
In the output, the location file names have an osch-
prefix, such as osch-20140408014604-175-1
.
Identify the external tables by querying the ALL_EXTERNAL_TABLES
data dictionary view:
SELECT table_name FROM all_external_tables \ WHERE table_name \ LIKE 'OSCHtable%' AND owner=schema;
Identify the database views by querying the ALL_VIEWS
data dictionary view:
SELECT view_name FROM all_views WHERE view_name LIKE 'table%' AND owner='schema';
Inspect the tables, views, and location files to verify that they are not needed, using commands like the following:
DESCRIBE schema.table; SELECT * FROM schema.table; DESCRIBE schema.view; SELECT * FROM schema.view;
Delete the location files, tables, and views that are not needed, using commands like the following:
EXECUTE utl_file.fremove('directory', 'location_file'); DROP TABLE schema.table; DROP VIEW schema.view;
Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. By default, all text columns in the external table are VARCHAR2
. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount
property.
All text data sources are automatically mapped to VARCHAR2(4000)
. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.colMap.*
properties listed under "Optional Properties."
These properties are required for delimited text sources:
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
This example is an XML template containing all the properties that describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-7 XML Template with Properties for a Delimited Text File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <!-- Use either columnCount or columnNames --> <property> <name>oracle.hadoop.exttab.columnCount</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.colMap.TYPE</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.recordDelimiter</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.initialFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.trailingFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> </configuration>
This example creates an external table named SALES_DT_XTAB
from delimited text files.
Example 2-8 Defining an External Table for Delimited Text Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /data/sales_dt_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/data/sales_dt_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;
Create the external table:
$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:$HADOOP_CLASSPATH"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable
command.
Additional syntax might not be supported for Data Pump format files.
To create an external table manually:
Use the -createTable --noexecute
command to generate the external table DDL.
Make whatever changes are needed to the DDL.
Run the DDL from the previous step to create the table definition in the Oracle database.
Use the ExternalTable -publish
command to publish the data URIs to the location files of the external table.
The -publish
option provides a way to update the location files (that point to HDFS data paths) of existing Oracle external tables based on text, Data Pump, or Hive sources.
The ExternalTable
command line tool with the -createTable
option creates the external table and related metadata in Oracle Database. It also populates the external table’s location files with the Universal Resource Identifiers (URIs) of the data files in HDFS.
To “publish” updates to the location files of existing external tables, use ExternalTable
with the -publish
option. This operation updates the location files of external tables with new URIs of HDFS data paths, and adds external tables and views for new partitions when the source is a partitioned Hive table
Use -publish
in order to:
Publish new data into an already existing external table.
-createTable
takes a snapshot of the HDFS or Hive source at the time of external table creation. However the source may change later. The -publish
option enables you to update the existing external table from the source.
Publish new data when the source is a partitioned Hive table.
When new partitions are added to a source that is a partitioned Hive table, the -publish
option enables you to add new database objects required to access the new partitions. This option detects the new Hive partitions and creates the additional external tables and views for the partitions and updates the metadata table created with the -createTable
command.
Populate an external table that you created manually using the -createTable command with the --noexecute
option.
Note:
The publish
option now fully supports partitioned Hive tables. It is no longer necessary to use -drop
and -createTable
as a workaround to update external tables derived from partitioned Hive tables.
See Also:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -publish [--noexecute]
See Also:
This example sets HADOOP_CLASSPATH
and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.
Example 2-9 Publishing HDFS Data Paths to an External Table for Data Pump Format Files
This example uses the Bash shell.
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.exttab.sourceType=datapump \ -D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -publish
In this example:
OSCH_HOME
is the full path to the Oracle SQL Connector for HDFS installation directory.
SALES_DP_XTAB
is the external table created in Example 2-3.
hdfs:/user/scott/data/
is the location of the HDFS data.
@myhost:1521
is the database connection string.
The -describe
command is a debugging and diagnostic utility that prints the definition of an existing external table. It also enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.
These properties are required to use this command:
The JDBC connection properties; see "Connections using url, user, and password Properties."
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -describe
See Also:
This example shows the command syntax to describe the external tables and location files associated with SALES_DP_XTAB
.
Example 2-10 Exploring External Tables and Location Files
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -describe
The -drop
command deletes the database objects created by Oracle SQL Connector for HDFS. These objects include external tables, location files, and views. If you delete objects manually, problems can arise as described in "Dropping Dangling Objects".
The -drop
command only deletes objects created by Oracle SQL Connector for HDFS. Oracle recommends that you always use the -drop
command to drop objects created by Oracle SQL Connector for HDFS.
These properties are required to use this command:
oracle.hadoop.exttab.tableName
. For partitioned Hive tables, this is the name of the metadata table. For other data source types, this is the name of the external table.
The JDBC connection properties; see "Connections using url, user, and password Properties."
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -drop
See Also:
This example shows the command syntax to drop the database objects associated with SALES_DP_XTAB
.
Example 2-11 Dropping Database Objects
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.connection.user=scott -drop
Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.
Oracle SQL Connector for HDFS uses the ORACLE_LOADER
access driver. The hdfs_stream
preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER
can process.
Oracle SQL Connector for HDFS uses default data type mappings to create columns in an Oracle external table with the appropriate data types for the Hive and text sources. You can override these defaults by setting various configuration properties, for either all columns or a specific column.
For example, a field in a text file might contain a timestamp. By default, the field is mapped to a VARCHAR2
column. However, you can specify a TIMESTAMP
column and provide a datetime mask to cast the values correctly into the TIMESTAMP
data type. The TIMESTAMP
data type supports time-based queries and analysis that are unavailable when the data is presented as text.
Text sources are mapped to VARCHAR2
columns, and Hive columns are mapped to columns with the closest equivalent Oracle data type. Table 2-1 shows the default mappings.
The following properties apply to all columns in the external table. For Hive sources, these property settings override the oracle.hadoop.exttab.hive.*
property settings.
The following properties apply to only one column, whose name is the column_name part of the property name. These property settings override all other settings.
The following properties create an external table in which all columns are the default VARCHAR2
data type:
oracle.hadoop.exttab.tableName=MOVIE_FACT_EXT_TAB_TXT oracle.hadoop.exttab.columnNames=CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES
In this example, the following properties are set to override the data type of several columns:
oracle.hadoop.exttab.colMap.TIME_ID.columnType=TIMESTAMP oracle.hadoop.exttab.colMap.RECOMMENDED.columnType=NUMBER oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType=NUMBER oracle.hadoop.exttab.colMap.RATING.columnType=NUMBER oracle.hadoop.exttab.colMap.SALES.columnType=NUMBER
Oracle SQL Connector for HDFS creates an external table with the specified data types:
SQL> DESCRIBE movie_facts_ext
Name Null? Type
----------------------------------------- -------- ----------------------------
CUST_ID VARCHAR2(4000)
MOVIE_ID VARCHAR2(4000)
GENRE_ID VARCHAR2(4000)
TIME_ID TIMESTAMP(9)
RECOMMENDED NUMBER
ACTIVITY_ID NUMBER
RATINGS NUMBER
SALES NUMBER
The next example adds the following property settings to change the length of the VARCHAR2
columns:
oracle.hadoop.exttab.colMap.CUST_ID.columnLength=12 oracle.hadoop.exttab.colMap.MOVIE_ID.columnLength=12 oracle.hadoop.exttab.colMap.GENRE_ID.columnLength=12
All columns now have custom data types:
SQL> DESCRIBE movie_facts_ext
Name Null? Type
----------------------------------------- -------- ----------------------------
CUST_ID VARCHAR2(12)
MOVIE_ID VARCHAR2(12)
GENRE_ID VARCHAR2(12)
TIME_ID TIMESTAMP(9)
RECOMMENDED NUMBER
ACTIVITY_ID NUMBER
RATINGS NUMBER
SALES NUMBER
A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.
To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.
Ideally, you can decide to run at a particular degree of parallelism and create a number of location files that are a multiple of the degree of parallelism, as described in the following procedure.
To set up parallel processing for maximum performance:
Identify the maximum DOP that your Oracle DBA will permit you to use when running Oracle SQL Connector for HDFS.
When loading a huge amount of data into an Oracle database, you should also work with the DBA to identify a time when the maximum resources are available.
Create a number of location files that is a small multiple of the DOP. For example, if the DOP is 8, then you might create 8, 16, 24, or 32 location files.
Create a number of HDFS files that are about the same size and a multiple of the number of location files. For example, if you have 32 location files, then you might create 128, 1280, or more HDFS files, depending on the amount of data and the minimum HDFS file size.
Set the DOP for the data load, using either the ALTER SESSION
command or hints in the SQL SELECT
statement.
This example sets the DOP to 8 using ALTER SESSION
:
ALTER SESSION FORCE PARALLEL DML PARALLEL 8; ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8;
The next example sets the DOP to 8 using the PARALLEL
hint:
INSERT /*+ parallel(my_db_table,8) */ INTO my_db_table \ SELECT /*+ parallel(my_hdfs_external_table,8) */ * \ FROM my_hdfs_external_table;
An APPEND
hint in the SQL INSERT
statement can also help improve performance.
The Oracle SQL Connector for HDFS command-line tool, ExternalTable
, creates an external table and publishes the HDFS URI information to location files. The external table location files are stored in the directory specified by the oracle.hadoop.exttab.defaultDirectory
property. For an Oracle RAC database, this directory must reside on a distributed file system that is accessible to each database server.
ExternalTable
manages the location files of the external table, which involves the following operations:
Generating new location files in the database directory after checking for name conflicts
Deleting existing location files in the database directory as necessary
Publishing data URIs to new location files
Altering the LOCATION
clause of the external table to match the new location files
Location file management for the supported data sources is described in the following topics.
The ORACLE_LOADER
access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.
Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.
The ORACLE_LOADER
access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount
configuration property.
This is the format of a location file name:
osch-
timestamp-number-n
In this syntax:
timestamp
has the format yyyyMMddhhmmss
, for example, 20121017103941 for October 17, 2012, at 10:39:41.
number
is a random number used to prevent location file name conflicts among different tables.
n
is an index used to prevent name conflicts between location files for the same table.
For example, osch-20121017103941-6807-1.
You can pass configuration properties to the ExternalTable
tool on the command line with the -D
option, or you can create a configuration file and pass it on the command line with the -conf
option. These options must precede the command to be executed.
For example, this command uses a configuration file named example.xml
:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-conf /home/oracle/example.xml \
-createTable
A configuration file is an XML document with a very simple structure as follows:
<?xml version="1.0"?> <configuration> <property> <name>property</name> <value>value</value> </property> . . . </configuration>
The following example shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.
Example 2-12 Configuration File for Oracle SQL Connector for HDFS
<?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>SH.SALES_EXT_DIR</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/data/s1/*.csv,/data/s2/*.csv</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>SH</value> </property> </configuration>
The following is a complete list of the configuration properties used by the ExternalTable
command-line tool. The properties are organized into these categories:
General Properties
Property | Description | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
oracle.hadoop.exttab.badFileFormatUsePercentA |
Indicates whether an external table bad file name contains '%a'. This is an optional property for the Valid values: Default value: The default bad file name is If See Also:
|
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.columnLength |
Specifies the length of all external table columns of type Default Value: The maximum length allowed by the column type For Oracle Database 12c, Oracle SQL Connector for HDFS sets the length of Valid values: Integer |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.columnType |
Specifies the data type mapping of all columns for Hive and text sources. Optional. You can override this setting for specific columns by setting Default value: Valid values: The following Oracle data types are supported:
|
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.dateMask |
Specifies the format mask used in the date_format_spec clause of the external table for all Default value: The default globalization format mask, which is set by the Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.fieldLength |
Sets the character buffer length used by the Default value: 4000 bytes Valid values: Integer |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.timestampMask |
Specifies the format mask used in the date_format_spec clause of the external table for all Default value: The default globalization format mask, which is set by the Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.timestampTZMask |
Specifies the format mask used in the date_format_spec clause of the external table for all Default value: The default globalization format mask, which is set by the Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.columnLength |
Specifies the length of all external table columns of type Default Value: The value of Valid values: Integer |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.columnType |
Overrides the data type mapping for column_name. Optional. The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in Default value: The value of Valid values: See |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.dateMask |
Overrides the format mask for column_name. Optional. The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in Default value: The value of Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.fieldLength |
Overrides the character buffer length used by the The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in Default value: Oracle SQL Connector for HDFS sets the default field lengths as shown in the following table. Table 2-4 Field Length Calculations
Valid values: Integer |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.nullIfSpecifier |
This property is applied to a column identified by column_name in an external table. Optional. Overrides the property Type: string Valid values: same as for the property Default values: none. This property applies only to Delimited Text sources. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.timestampMask |
Overrides the format mask for column_name. Optional. The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in Default value: The value of Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.colMap.column_name.timestampTZMask |
Overrides the format mask for column_name. Optional. The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in Default value: The value of Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks. |
||||||||||||||||||||||
oracle.hadoop.exttab.columnCount |
Specifies the number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property. This property is ignored if The You must set either this property or |
||||||||||||||||||||||
oracle.hadoop.exttab.columnNames |
Specifies a comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered. The You must set either this property or |
||||||||||||||||||||||
oracle.hadoop.exttab.dataCompressionCodec |
Notice: This property is deprecated.OSCH discovers the compression code of each file at runtime. The dataset can contain both compressed and uncompressed files and can also contain files compressed with different codecs. Specifies the name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional. This property specifies the class name of any compression codec that implements the Several standard codecs are available in Hadoop, including the following:
Default value: None |
||||||||||||||||||||||
oracle.hadoop.exttab.dataPaths |
Specifies a comma-separated list of fully qualified HDFS paths. This property enables you to restrict the input by using special pattern-matching characters in the path specification. See the following table. This property is required for the For example, to select all files in /data/s2/,/data/s[7-9]/*.csv The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set. The data set can contain compressed files or uncompressed files, but not both. Table 2-5 Pattern-Matching Characters
|
||||||||||||||||||||||
oracle.hadoop.exttab.dataPathFilter |
Specifies the path filter class. This property is ignored for Hive data sources. Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered. |
||||||||||||||||||||||
oracle.hadoop.exttab.defaultDirectory |
Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object. In Oracle RAC, this directory must be on a shared directory accessible by all Oracle instances. Valid value: The name of an existing database directory Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names. The |
||||||||||||||||||||||
oracle.hadoop.exttab.fieldTerminator |
Specifies the field terminator for an external table when Default value: , (comma) Valid values: A string in one of the following formats:
Do not mix the two formats. |
||||||||||||||||||||||
oracle.hadoop.exttab.hive.columnType.* |
Maps a Hive data type to an Oracle data type. The property name identifies the Hive data type, and its value is an Oracle data type. The target columns in the external table are created with the Oracle data type indicated by this property. When Hive Default values: The following table lists the Hive column type properties and their default values. Valid values: See the valid values for Table 2-6 Hive Column Type Mapping Properties
|
||||||||||||||||||||||
oracle.hadoop.exttab.hive.databaseName |
Specifies the name of a Hive database that contains the input data table. The |
||||||||||||||||||||||
oracle.hadoop.exttab.hive.partitionFilter |
Specifies a valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the table is not partitioned. Type: String Default value: None. All partitions of the Hive table are mapped to external tables. Valid values: A valid HiveQL expression. Description: Specifies a valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the Hive table is not partitioned. Including other columns does not raise an error, but unintended consequences can result. Oracle recommends that you exclude other columns. The expression must conform to the following restrictions:
CREATE TABLE t(c string) PARTITIONED BY (p1 string, p2 int, p3 boolean, p4 string, p5 timestamp); Example 1: Nested Expressions p1 like 'abc%' or (p5 >= '2010-06-20' and p5 <= '2010-07-03') Example 2: Built-in Functions year(p5) = 2014 Example 3: Bad Usage: Columns That Are Not Used to Partition the Table These examples show that using c, a column that is not used to partition the table, is unnecessary and can cause unexpected results. This example is equivalent to p2 > 35 and c like 'abc%' This example loads all partitions. All partitions could contain p2 > 35 or c like 'abc%' |
||||||||||||||||||||||
oracle.hadoop.exttab.hive.refreshTables |
Only applies when the source is a Hive partitioned table, is ignored otherwise. This property specifies whether the Set this property to Default value: |
||||||||||||||||||||||
oracle.hadoop.exttab.hive.tableName |
Specifies the name of an existing Hive table. The |
||||||||||||||||||||||
oracle.hadoop.exttab.hive.deleteObsoleteTables |
Specifies whether the Set this property to Default value: |
||||||||||||||||||||||
oracle.hadoop.exttab.initialFieldEncloser |
Specifies the initial field encloser for an external table created from delimited text files. Optional. Default value: null; no enclosers are specified for the external table definition. The Valid values: A string in one of the following formats:
Do not mix the two formats. |
||||||||||||||||||||||
oracle.hadoop.exttab.locationFileCount |
Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files. Default value: 4 This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:
At least one location file is created. See "Enabling Parallel Processing" for more information about the number of location files. |
||||||||||||||||||||||
oracle.hadoop.exttab.logDirectory |
Specifies a database directory where log files and bad files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by This is an optional property for the These are the default file name extensions:
Valid values: An existing Oracle directory object name. Unquoted names are changed to uppercase. Quoted names are not changed. The following table provides examples of how values are transformed. Table 2-7 Examples of Quoted and Unquoted Values
|
||||||||||||||||||||||
oracle.hadoop.exttab.nullIfSpecifier |
Specifies the NULLIF clause of the external table definition. Optional. This property is applied to all the columns in an external table. Type: string Valid values: a string in following formats:
Default values: none. This property applies only to Delimited Text sources. See Also: Example 2-1. This example shows how to usenullIfSpecifier when accessing HDFS data files From Oracle Database. |
||||||||||||||||||||||
oracle.hadoop.exttab.preprocessorDirectory |
Specifies the database directory for the preprocessor. The file-system directory must contain the Default value: The preprocessor directory is used in the |
||||||||||||||||||||||
oracle.hadoop.exttab.preprocessorScript |
Specifies the name of the preprocessor script for the external table. Default value: The preprocessor script name is used in the |
||||||||||||||||||||||
oracle.hadoop.exttab.recordDelimiter |
Specifies the record delimiter for an external table created from delimited text files. Optional. Default value: The Valid values: A string in one of the following formats:
Do not mix the two formats. |
||||||||||||||||||||||
oracle.hadoop.exttab.sourceType |
Specifies the type of source files. The Default value: Valid values: |
||||||||||||||||||||||
oracle.hadoop.exttab.stringSizes |
Indicates whether the lengths specified for character strings are bytes or characters. This value is used in the Default value: Valid values: |
||||||||||||||||||||||
oracle.hadoop.exttab.createLogFiles |
Specifies whether the log files should be created when the external tables are queried. Oracle recommends enabling log file creation during development and disabling log file creation during production for best performance. Default value: Log files are created by default. To stop creating log files you must drop the table, set this property to |
||||||||||||||||||||||
oracle.hadoop.exttab.printVerbose |
Specifies whether to print verbose reports on the console during a Default value: |
||||||||||||||||||||||
oracle.hadoop.exttab.createBadFiles |
Specifies whether bad files should be created when the external tables are queried. Bad files contain information on rows with bad data. Bad files are created only when there is bad data. Oracle recommends creating bad files. Default value: Bad files are created by default. To stop creating bad files you must drop the table, set this property to This property applies only to Hive and Delimited Text sources. |
||||||||||||||||||||||
oracle.hadoop.exttab.logFileFormatUsePercentA |
Indicates whether an external table log file name contains '%a'. This is an optional property for the Valid values: Default value: The default log file name is Diagnostic external tables can be created over the See Also:
|
||||||||||||||||||||||
oracle.hadoop.exttab.tableName |
Specifies the metadata table for partitioned Hive tables or schema-qualified name of the external table for all other data sources, in this format: schemaName.tableName If you omit schemaName, then the schema name defaults to the connection user name. Default value: none Required property for all operations. |
||||||||||||||||||||||
oracle.hadoop.exttab.trailingFieldEncloser |
Specifies the trailing field encloser for an external table created from delimited text files. Optional. Default value: null; defaults to the value of The Valid values: A string in one of the following formats:
Do not mix the two formats. |
Connections using url, user, and password Properties
The url, user, and password properties provide a distinct connection method. Do not mix these properties with those required for a connection using an Oracle Wallet.
Property | Description |
---|---|
oracle.hadoop.connection.url |
Specifies the database connection string in the thin-style service name format: jdbc:oracle:thin:@//host_name:port/service_name If you are unsure of the service name, then enter this SQL command as a privileged user: SQL> show parameter service This property takes precedence over all other connection properties. Default value: Not defined Valid values: A string |
oracle.hadoop.connection.user |
Specifies an Oracle database log-in name. The Default value: Not defined Valid values: A string |
oracle.hadoop.connection.password |
Password for the Oracle Database user. Oracle recommends that you do not use this property to store a clear text password outside of non-sensitive test or demo environments. You can force a password prompt/response by excluding the password property from the connection. In that case, the externalTable tool prompts for the password. If you require a connection with no prompt/response, use the Oracle Wallet connection method described in the next section instead. Default value: Not defined. Valid values: A string |
Connections Using Oracle Wallet
When using Oracle Wallet as an external password store, set the properties shown in the following table.
oracle.hadoop.connection.tnsEntryName |
Specifies a TNS entry name defined in the tnsnames.ora file. This property is used with the Default value: Not defined Valid values: A string |
oracle.hadoop.connection.tns_admin |
Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java This property must be set when using Oracle Wallet as an external password store. Default value: The value of the Java Valid values: A string |
oracle.hadoop.connection.wallet_location |
Specifies a file path to an Oracle Wallet directory where the connection credential is stored. Default value: Not defined Valid values: A string |
Tip:
Connections using Oracle Wallet can accommodate many TNS entries and are therefore recommended over those using the user, password and url properties which are restricted to a single machine/port/servicename combination.For a simple step-by-step demonstration, see the posting Using Oracle SQL Connector for HDFS with Oracle Wallet in the Connecting Hadoop With Oracle blog.
Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND
and PQ_DISTRIBUTE
also improve performance when you are inserting data.
SeeMy Oracle Support Document 2111850.1 for additional details and examples for improving performance.