This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.
This chapter contains the following sections:
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Hadoop in these formats:
Data Pump files in HDFS
Delimited text files in HDFS
Hive tables
For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS.
Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. By querying the external tables, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.
To create an external table for this purpose, you use the ExternalTable
command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable
with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as parameters to the ExternalTable command or in an XML file.
When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.
You can also perform bulk loads of data into Oracle database tables using SQL. You may prefer that the data resides in an Oracle database—all of it or just a selection—if it is queried routinely.
The following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.
The first time you use Oracle SQL Connector for HDFS, ensure that the software is installed and configured.
See "Configuring Your System for Oracle SQL Connector for HDFS."
Log in to the appropriate system, either the Oracle Database system or a node in the Hadoop cluster.
See "Configuring Your System for Oracle SQL Connector for HDFS."
Create an XML document describing the connections and the data source, unless you are providing these parameters in the ExternalTable
command.
Create a shell script containing an ExternalTable
command.
Run the shell script.
If the job fails, then use the diagnostic messages in the output to identify and correct the error. Depending on how far the job progressed before failing, you may need to delete the table definition from the Oracle database before rerunning the script.
After the job succeeds, connect to Oracle Database as the owner of the external table. Query the table to ensure that the data is accessible.
If the data will be queried frequently, then you may want to load it into a database table. External tables do not have indexes or partitions.
Example 2-1 illustrates these steps.
Example 2-1 Accessing HDFS Data Files from Oracle Database
$ cat moviefact_hdfs.sh # Add environment variables export OSCH_HOME="/opt/oracle/orahdfs-2.2.0" hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -conf /home/jdoe/movie/moviefact_hdfs.xml \ -createTable $ cat moviefact_hdfs.xml <?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>MOVIE_FACT_EXT_TAB_TXT</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>4</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/user/jdoe/moviework/data/part*</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>\u0009</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>MOVIE_DIR</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>text</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@//dbhost:1521/orcl.example.com</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>MOVIEDEMO</value> </property> </configuration> $ sh moviefact_hdfs.sh Oracle SQL Connector for HDFS Release 2.2.0 - Production Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved. Enter Database Password: password] The create table command succeeded. CREATE TABLE "MOVIEDEMO"."MOVIE_FACT_EXT_TAB_TXT" ( "CUST_ID" VARCHAR2(4000), "MOVIE_ID" VARCHAR2(4000), "GENRE_ID" VARCHAR2(4000), "TIME_ID" VARCHAR2(4000), "RECOMMENDED" VARCHAR2(4000), "ACTIVITY_ID" VARCHAR2(4000), "RATING" VARCHAR2(4000), "SALES" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "MOVIE_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' disable_directory_link_check CHARACTERSET AL32UTF8 STRING SIZES ARE IN CHARACTERS PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "CUST_ID" CHAR, "MOVIE_ID" CHAR, "GENRE_ID" CHAR, "TIME_ID" CHAR, "RECOMMENDED" CHAR, "ACTIVITY_ID" CHAR, "RATINGS" CHAR, "SALES" CHAR ) ) LOCATION ( 'osch-20130314092801-1513-1', 'osch-20130314092801-1513-2', 'osch-20130314092801-1513-3', 'osch-20130314092801-1513-4' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20130314092801-1513-1 contains 1 URI, 12754882 bytes 12754882 hdfs://bda-ns/user/jdoe/moviework/data/part-00001 osch-20130314092801-1513-2 contains 1 URI, 438 bytes 438 hdfs://bda-ns/user/jdoe/moviework/data/part-00002 osch-20130314092801-1513-3 contains 1 URI, 432 bytes 432 hdfs://bda-ns/user/jdoe/moviework/data/part-00003 osch-20130314092801-1513-4 contains 1 URI, 202 bytes 202 hdfs://bda-ns/user/jdoe/moviework/data/part-00004 SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 14 14:14:31 2013 Copyright (c) 1982, 2011, Oracle. All rights reserved. Enter password: password Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, OLAP and Data Mining options SQL> SELECT cust_id, movie_id, time_id FROM movie_fact_ext_tab_txt 2 WHERE rownum < 5; CUST_ID MOVIE_ID TIME_ID -------------------- -------------------- -------------------- 1150211 585 01-JAN-11 1221463 9870 01-JAN-11 1002672 1422 01-JAN-11 1095718 544 01-JAN-11 SQL> DESCRIBE movie_fact_ext_tab_txt Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID VARCHAR2(4000) MOVIE_ID VARCHAR2(4000) GENRE_ID VARCHAR2(4000) TIME_ID VARCHAR2(4000) RECOMMENDED VARCHAR2(4000) ACTIVITY_ID VARCHAR2(4000) RATINGS VARCHAR2(4000) SALES VARCHAR2(4000) SQL> CREATE TABLE movie_facts AS 2 SELECT CAST(cust_id AS VARCHAR2(12)) cust_id, 3 CAST(movie_id AS VARCHAR2(12)) movie_id, 4 CAST(genre_id AS VARCHAR(3)) genre_id, 5 TO_TIMESTAMP(time_id,'YYYY-MM-DD-HH24:MI:SS:FF') time, 6 TO_NUMBER(recommended) recommended, 7 TO_NUMBER(activity_id) activity_id, 8 TO_NUMBER(ratings) ratings, 9 TO_NUMBER(sales) sales 10 FROM movie_fact_ext_tab_txt; SQL> DESCRIBE movie_facts Name Null? Type ----------------------------------------- -------- ---------------------------- CUST_ID VARCHAR2(12) MOVIE_ID VARCHAR2(12) GENRE_ID VARCHAR2(3) TIME TIMESTAMP(9) RECOMMENDED NUMBER ACTIVITY NUMBER RATING NUMBER SALES NUMBER
You can run Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:
For Hive sources, you must log in to a node in the Hadoop cluster.
For text and Data Pump format files, you can log in to either the Oracle Database system or a node in the Hadoop cluster.
Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH
environment variable on the system where you log in. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "Setting Up User Accounts on the Oracle Database System" and "Setting Up User Accounts on the Hadoop Cluster".
Setting up the environment variables:
Verify that HADOOP_CLASSPATH
includes the path to the JAR files for Oracle SQL Connector for HDFS:
path/orahdfs-2.2.0/jlib/*
If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH
also includes the Hive JAR files and conf directory. For example:
/usr/lib/hive/lib/* /etc/hive/conf
For your convenience, you can create an OSCH_HOME
environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:
$ export OSCH_HOME="/opt/oracle/orahdfs-2.2.0"
See Also:
"Oracle SQL Connector for Hadoop Distributed File System Setup" for instructions for installing the software and setting up user accounts on both systems.OSCH_HOME/doc/README.txt for information about known problems with Oracle SQL Connector for HDFS.
Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable
. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.
The ExternalTable
tool uses the values of several properties to do the following tasks:
Create an external table
Populate the location files
Publish location files to an existing external table
List the location files
Describe an external table
You can specify these property values in an XML document or individually on the command line. See "Configuring Oracle SQL Connector for HDFS.".
This is the full syntax of the ExternalTable
command-line tool:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute] | -publish [--noexecute] | -listlocations [--details] | -getDDL
You can either create the OSCH_HOME
environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:
/opt/oracle/orahdfs-version
For example, you might run the ExternalTable
command-line tool with a command like this:
hadoop jar /opt/oracle/orahdfs-2.2.0/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ . . .
Parameter Descriptions
Identifies the name of an XML configuration file containing properties needed by the command being executed. See "Configuring Oracle SQL Connector for HDFS."
Assigns a value to a specific property.
Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. Oracle recommends that you first execute a -createTable
command with --noexecute
.
Publishes the data URIs to the location files of an existing external table. Use this option after adding new data files, so that the existing external table can access them.
Use the --noexecute
option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE
command and location files. The report also shows any errors. Oracle recommends that you first execute a -publish
command with --noexecute
.
Shows the location file content as text. With the --details
option, this command provides a detailed listing. See "What Are Location Files?."
Prints the table definition of an existing external table. See "Describing External Tables."
See Also:
"Syntax Conventions"You can create external tables automatically using the ExternalTable
tool provided in Oracle SQL Connector for HDFS.
To create an external table using the ExternalTable
tool, follow the instructions for your data source:
When the ExternalTable -createTable
command finishes executing, the external table is ready for use.
To create external tables manually, follow the instructions in "Creating External Tables in SQL."
ExternalTable Syntax for -createTable
Use the following syntax to create an external table and populate its location files:
hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -createTable [--noexecute]
See Also:
"ExternalTable Command-Line Tool Syntax"Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.
Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER access driver with the preprocessor
access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA
, which enables ORACLE_LOADER to read the Data Pump format files generated by Oracle Loader for Hadoop.
Note:
Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go tohttp://support.oracle.com
and search for bug 14557588.
Release 11.2.0.3 and later releases do not require this patch.
These properties are required:
oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.dataPaths
oracle.hadoop.exttab.sourceType=datapump
oracle.hadoop.connection.url
oracle.hadoop.connection.user
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
Example 2-2 is an XML template containing all the properties that can be used to describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-2 XML Template with Properties for a Data Pump Format File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>datapump</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.logDirectory</name> <value>value</value> </property> </configuration>
Example 2-3 creates an external table named SALES_DP_XTAB
to read Data Pump files.
Example 2-3 Defining an External Table for Data Pump Format Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /scratch/sales_dp_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/scratch/sales_dp_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;
Create the external table:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.
To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH
variable. This means that Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.
Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH
environment variable. You must have a correctly functioning Hive client.
For Hive managed tables, the data paths come from the warehouse directory.
For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.
The Oracle external table is not a "live" Hive table. When changes are made to a Hive table, you must use the ExternalTable
tool to either republish the data or create a new external table.
Oracle SQL Connector for HDFS supports non-partitioned Hive tables that are defined using ROW FORMAT DELIMITED
and FILE FORMAT TEXTFILE
clauses. Both Hive-managed tables and Hive external tables are supported.
Hive tables can be either bucketed or not bucketed. Table columns with all primitive types from Hive 0.7.1 (CDH3), and the DECIMAL
and TIMESTAMP
types are supported.
These properties are required for Hive table sources:
oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.sourceType=hive
oracle.hadoop.exttab.hive.tableName
oracle.hadoop.exttab.hive.databaseName
oracle.hadoop.connection.url
oracle.hadoop.connection.user
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
This property is optional for Hive table sources:
oracle.hadoop.exttab.locationFileCount
Example 2-4 is an XML template containing all the properties that can be used to describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-4 XML Template with Properties for a Hive Table
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>hive</value> </property> <property> <name>oracle.hadoop.exttab.hive.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.hive.databaseName</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> </configuration>
Example 2-5 creates an external table named SALES_HIVE_XTAB
to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.
Example 2-5 Defining an External Table for a Hive Table
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /scratch/sales_hive_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/scratch/sales_hive_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;
Create the external table:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. All text columns in the external table are VARCHAR2. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount
property.
These properties are required for delimited text sources:
oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.dataPaths
oracle.hadoop.exttab.columnCount
or oracle.hadoop.exttab.columnNames
oracle.hadoop.connection.url
oracle.hadoop.connection.user
See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.
These properties are optional for delimited text sources:
oracle.hadoop.exttab.recordDelimiter
oracle.hadoop.exttab.fieldTerminator
oracle.hadoop.exttab.initialFieldEncloser
oracle.hadoop.exttab.trailingFieldEncloser
oracle.hadoop.exttab.locationFileCount
Example 2-6 is an XML template containing all the properties that can be used to describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."
Example 2-6 XML Template with Properties for a Delimited Text File
<?xml version="1.0"?> <!-- Required Properties --> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>value</value> </property> <!-- Use either columnCount or columnNames --> <property> <name>oracle.hadoop.exttab.columnCount</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.columnNames</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>value</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>value</value> </property> <!-- Optional Properties --> <property> <name>oracle.hadoop.exttab.recordDelimiter</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.fieldTerminator</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.initialFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.trailingFieldEncloser</name> <value>value</value> </property> <property> <name>oracle.hadoop.exttab.locationFileCount</name> <value>value</value> </property> </configuration>
Example 2-7 creates an external table named SALES_DT_XTAB
from delimited text files.
Example 2-7 Defining an External Table for Delimited Text Files
Log in as the operating system user that Oracle Database runs under (typically the oracle
user), and create a file-system directory:
$ mkdir /scratch/sales_dt_dir
Create a database directory and grant read and write access to it:
$ sqlplus / as sysdba SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/scratch/sales_dt_dir' SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;
Create the external table:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable
You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable
command.
Additional syntax might not be supported for Data Pump format files.
To create an external table manually:
Use the -createTable --noexecute
command to generate the external table DDL.
Make whatever changes are needed to the DDL.
Run the DDL from Step 2 to create the table definition in the Oracle database.
Use the ExternalTable -publish
command to publish the data URIs to the location files of the external table.
The -createTable
command creates the metadata in Oracle Database and populates the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS. However, you might publish the URIs as a separate step from creating the external table in cases like these:
You want to publish new data into an already existing external table.
You created the external table manually instead of using the ExternalTable
tool.
In both cases, you can use ExternalTable
with the -publish
command to populate the external table location files with the URIs of the data files in HDFS. See "Location File Management".
ExternalTable Syntax for Publish
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -publish [--noexecute]
See Also:
"ExternalTable Command-Line Tool Syntax"ExternalTable Command-Line Tool Example
Example 2-8 sets HADOOP_CLASSPATH
and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.
Example 2-8 Publishing HDFS Data Paths to an External Table for Data Pump Format Files
This example uses the Bash shell.
$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*" $ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \ -D oracle.hadoop.exttab.sourceType=datapump \ -D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ -D oracle.hadoop.exttab.connection.user=scott -publish
In this example:
OSCH_HOME
is the full path to the Oracle SQL Connector for HDFS installation directory.
SALES_DP_XTAB
is the external table created in Example 2-3.
hdfs:/user/scott/data/
is the location of the HDFS data.
@myhost:1521/orcl
is the database connection string.
The -listLocations
command is a debugging and diagnostic utility that enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.
These properties are required to use this command:
oracle.hadoop.exttab.tableName
The JDBC connection properties; see "Connection Properties."
ExternalTable Syntax for -listLocations
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -listLocations [--details]
See Also:
"ExternalTable Command-Line Tool Syntax"The -getDDL
command is a debugging and diagnostic utility that prints the definition of an existing external table. This command follows the security model of the PL/SQL DBMS_METADATA
package, which enables non-privileged users to see the metadata for their own objects.
These properties are required to use this command:
oracle.hadoop.exttab.tableName
The JDBC connection properties; see "Connection Properties."
ExternalTable Syntax for -getDDL
hadoop jar OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ [-conf config_file]... \ [-D property=value]... \ -getDDL
See Also:
"ExternalTable Command-Line Tool Syntax"Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.
Oracle SQL Connector for HDFS uses the ORACLE_LOADER
access driver. The hdfs_stream preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER
can process.
See Also:
Oracle Database Administrator's Guide for information about external tables
Oracle Database Utilities for more information about external tables, performance hints, and restrictions when you are using the ORACLE_LOADER
access driver.
A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.
To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files, also known as the degree of parallelism, determines the number of child processes started by the external table during a table read. Ideally, the degree of parallelism is no larger than the number of data files, to avoid idle child processes.
The Oracle SQL Connector for HDFS command-line tool, ExternalTable
, manages the location files of the external table. Location file management involves the following operations:
Generating new location files in the database directory after checking for name conflicts
Deleting existing location files in the database directory as necessary
Publishing data URIs to new location files
Altering the LOCATION
clause of the external table to match the new location files
Location file management for the supported data sources is described in the following topics.
The ORACLE_LOADER
access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.
Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.
The ORACLE_LOADER
access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount
configuration property.
This is the format of a location file name:
osch-
timestamp-number-n
In this syntax:
timestamp
has the format yyyyMMddhhmmss
, for example, 20121017103941 for October 17, 2012, at 10:39:41.
number
is a random number used to prevent location file name conflicts among different tables.
n
is an index used to prevent name conflicts between location files for the same table.
For example, osch-20121017103941-6807-1.
You can pass configuration properties to the ExternalTable
tool on the command line with the -D
option, or you can create a configuration file and pass it on the command line with the -conf
option. These options must precede the command to be executed (-createTable
, -publish
, -listLocations
, or -getDDL
).
For example, this command uses a configuration file named example.xml:
hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-conf /home/oracle/example.xml \
-createTable
See "ExternalTable Command-Line Tool Syntax".
A configuration file is an XML document with a very simple structure as follows:
<?xml version="1.0"?> <configuration> <property> <name>property</name> <value>value</value> </property> . . . </configuration>
Example 2-9 shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.
Example 2-9 Configuration File for Oracle SQL Connector for HDFS
<?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>SH.SALES_EXT_DIR</value> </property> <property> <name>oracle.hadoop.exttab.dataPaths</name> <value>/data/s1/*.csv,/data/s2/*.csv</value> </property> <property> <name>oracle.hadoop.exttab.dataCompressionCodec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>SH</value> </property> </configuration>
The following is a complete list of the configuration properties used by the ExternalTable
command-line tool. The properties are organized into these categories:
The number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property.
This property is ignored if oracle.hadoop.exttab.columnNames
is set.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
You must set one of these properties when creating an external table from delimited text files:
oracle.hadoop.exttab.columnNames
oracle.hadoop.exttab.columnCount
A comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the oracle.hadoop.exttab.columnCount
property.
The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
You must set one of these properties when creating an external table from delimited text files:
oracle.hadoop.exttab.columnNames
oracle.hadoop.exttab.columnCount
The name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional.
This property specifies the class name of the compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec
interface. This codec applies to all data files.
Several standard codecs are available in Hadoop, including the following:
Default value: None
A comma-separated list of fully qualified HDFS paths. This parameter enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-1. This property is required for the -createTable
and -public
commands using Data Pump or delimited text files. The property is ignored for Hive data sources.
For example, to select all files in /data/s2/
, and only the CSV files in /data/s7/
, /data/s8/
, and /data/s9/
, enter this expression:
/data/s2/,/data/s[7-9]/*.csv
The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.
The data set can contain compressed files or uncompressed files, but not both.
Table 2-1 Pattern-Matching Characters
Character | Description |
---|---|
? |
Matches any single character |
* |
Matches zero or more characters |
[ |
Matches a single character from the character set {a, b, c} |
[ |
Matches a single character from the character range {a...b}. The character a must be less than or equal to b. |
[^ |
Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket. |
\ |
Removes any special meaning of character c. The backslash is the escape character. |
{ |
Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator. |
{ |
Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator. |
The path filter class. This property is ignored for Hive data sources.
Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered.
Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object.
Valid value: The name of an existing database directory
Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names.
The -createTable
command requires this property.
Specifies the field terminator for an external table when oracle.hadoop.exttab.sourceType=text
. Optional.
Default value: , (comma)
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
For example, \t represents a tab.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. For example, \u0009 represents a tab. The hexadecimal digits are case insensitive.
Do not mix the two formats.
The name of a Hive database that contains the input data table.
The -createTable
command requires this property when oracle.hadoop.exttab.sourceType=hive
.
The name of an existing Hive table.
The -createTable
command requires this property when oracle.hadoop.exttab.sourceType=hive
.
Specifies the initial field encloser for an external table created from delimited text files. Optional.
Default value: null; no enclosers are specified for the external table definition.
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files.
Default value: 4
This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:
The number of data files
The value of this property
At least one location file is created.
See "Enabling Parallel Processing" for more information about the number of location files.
Specifies a database directory where log files, bad files, and discard files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by _%p.log.
This is an optional property for the -createTable
command.
These are the default file name extensions:
Log files: log
Bad files: bad
Discard files: dsc
Valid values: An existing Oracle directory object name.
Unquoted names are uppercased. Quoted names are not changed. Table 2-2 provides examples of how values are transformed.
Specifies the database directory for the preprocessor. The file-system directory must contain the hdfs_stream script.
Default value: OSCH_BIN_PATH
The preprocessor directory is used in the PREPROCESSOR
clause of the external table.
Specifies the record delimiter for an external table created from delimited text files. Optional.
Default value: \n
The -createTable
command uses this parameter when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the type of source files.
The valid values are datapump
, hive
, and text
.
Default value: text
The -createTable
and -publish
operations require the value of this parameter.
Schema-qualified name of the external table in this format:
schemaName.tableName
If you omit schemaName, then the schema name defaults to the connection user name.
Default value: none
Required property for all operations.
Specifies the trailing field encloser for an external table created from delimited text files. Optional.
Default value: null; defaults to the value of oracle.hadoop.exttab.initialFieldEncloser
The -createTable
command uses this property when oracle.hadoop.exttab.sourceType=text
.
Valid values: A string in one of the following formats:
One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \u
HHHH
, where HHHH
is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.
Do not mix the two formats.
Specifies the database connection string in the thin-style service name format:
jdbc:oracle:thin:@//host_name:port/service_name
If you are unsure of the service name, then enter this SQL command as a privileged user:
SQL> show parameter service
If an Oracle wallet is configured as an external password store, then the property value must start with the driver prefix jdbc:oracle:thin:@
and db_connect_string
must exactly match the credentials defined in the wallet.
This property takes precedence over all other connection properties.
Default value: Not defined
Valid values: A string
An Oracle database log-in name. The externalTable
tool prompts for a password. This parameter is required unless you are using Oracle wallet as an external password store.
Default value: Not defined
Valid values: A string
Specifies a TNS entry name defined in the tnsnames.ora file.
This property is used with the oracle.hadoop.connection.tns_admin
property.
Default value: Not defined
Valid values: A string
Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java oracle.net.tns_admin
property. When both are set, this property takes precedence over oracle.net.tns_admin
.
This property must be set when using Oracle Wallet as an external password store. See oracle.hadoop.connection.wallet_location.
Default value: The value of the Java oracle.net.tns_admin
system property
Valid values: A string
A file path to an Oracle wallet directory where the connection credential is stored.
Default value: Not defined
Valid values: A string
When using Oracle Wallet as an external password store, set these properties:
oracle.hadoop.connection.wallet_location
oracle.hadoop.connection.url
or oracle.hadoop.connection.tnsEntryName
oracle.hadoop.connection.tns_admin
Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND
and PQ_DISTRIBUTE
also improve performance when you are inserting data.