2 Oracle SQL Connector for Hadoop Distributed File System

This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.

This chapter contains the following sections:

About Oracle SQL Connector for HDFS
Getting Started With Oracle SQL Connector for HDFS
Configuring Your System for Oracle SQL Connector for HDFS
Using the ExternalTable Command-Line Tool
Creating External Tables
Publishing the HDFS Data Paths
Listing Location File Metadata and Contents
Describing External Tables
More About External Tables Generated by the ExternalTable Tool
Configuring Oracle SQL Connector for HDFS
Performance Tips for Querying Data in HDFS

2.1 About Oracle SQL Connector for HDFS

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Hadoop in these formats:

Data Pump files in HDFS
Delimited text files in HDFS
Hive tables

For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS.

Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. By querying the external tables, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.

To create an external table for this purpose, you use the ExternalTable command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as parameters to the ExternalTable command or in an XML file.

When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.

You can also perform bulk loads of data into Oracle database tables using SQL. You may prefer that the data resides in an Oracle database—all of it or just a selection—if it is queried routinely.

2.2 Getting Started With Oracle SQL Connector for HDFS

The following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.

The first time you use Oracle SQL Connector for HDFS, ensure that the software is installed and configured.

See "Configuring Your System for Oracle SQL Connector for HDFS."
Log in to the appropriate system, either the Oracle Database system or a node in the Hadoop cluster.

See "Configuring Your System for Oracle SQL Connector for HDFS."
Create an XML document describing the connections and the data source, unless you are providing these parameters in the ExternalTable command.

See "Describing External Tables."
Create a shell script containing an ExternalTable command.

See "Using the ExternalTable Command-Line Tool."
Run the shell script.
If the job fails, then use the diagnostic messages in the output to identify and correct the error. Depending on how far the job progressed before failing, you may need to delete the table definition from the Oracle database before rerunning the script.
After the job succeeds, connect to Oracle Database as the owner of the external table. Query the table to ensure that the data is accessible.
If the data will be queried frequently, then you may want to load it into a database table. External tables do not have indexes or partitions.

Example 2-1 illustrates these steps.

Example 2-1 Accessing HDFS Data Files from Oracle Database

$ cat moviefact_hdfs.sh
# Add environment variables
export OSCH_HOME="/opt/oracle/orahdfs-2.2.0"
 
hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/jdoe/movie/moviefact_hdfs.xml \
       -createTable

$ cat moviefact_hdfs.xml
<?xml version="1.0"?>
 <configuration>
    <property>
      <name>oracle.hadoop.exttab.tableName</name>
      <value>MOVIE_FACT_EXT_TAB_TXT</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.locationFileCount</name>
      <value>4</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.dataPaths</name>
      <value>/user/jdoe/moviework/data/part*</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.fieldTerminator</name>
      <value>\u0009</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.defaultDirectory</name>
      <value>MOVIE_DIR</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.columnNames</name>
      <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.sourceType</name>
      <value>text</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.url</name>
      <value>jdbc:oracle:thin:@//dbhost:1521/orcl.example.com</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.user</name>
      <value>MOVIEDEMO</value>
    </property>
</configuration>

$ sh moviefact_hdfs.sh
Oracle SQL Connector for HDFS Release 2.2.0 - Production
 
Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
 
Enter Database Password: password]
The create table command succeeded.
 
CREATE TABLE "MOVIEDEMO"."MOVIE_FACT_EXT_TAB_TXT"
(
  "CUST_ID"                        VARCHAR2(4000),
 "MOVIE_ID"                       VARCHAR2(4000),
 "GENRE_ID"                       VARCHAR2(4000),
 "TIME_ID"                        VARCHAR2(4000),
 "RECOMMENDED"                    VARCHAR2(4000),
 "ACTIVITY_ID"                    VARCHAR2(4000),
 "RATING"                        VARCHAR2(4000),
 "SALES"                          VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "MOVIE_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     disable_directory_link_check
     CHARACTERSET AL32UTF8
     STRING SIZES ARE IN CHARACTERS
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'2C'
     MISSING FIELD VALUES ARE NULL
     (
       "CUST_ID" CHAR,
       "MOVIE_ID" CHAR,
       "GENRE_ID" CHAR,
       "TIME_ID" CHAR,
       "RECOMMENDED" CHAR,
       "ACTIVITY_ID" CHAR,
       "RATINGS" CHAR,
       "SALES" CHAR
     )
   )
   LOCATION
   (
     'osch-20130314092801-1513-1',
     'osch-20130314092801-1513-2',
     'osch-20130314092801-1513-3',
     'osch-20130314092801-1513-4'
   )
) PARALLEL REJECT LIMIT UNLIMITED;
 
The following location files were created.
 
osch-20130314092801-1513-1 contains 1 URI, 12754882 bytes
 
    12754882 hdfs://bda-ns/user/jdoe/moviework/data/part-00001
 
osch-20130314092801-1513-2 contains 1 URI, 438 bytes
 
         438 hdfs://bda-ns/user/jdoe/moviework/data/part-00002
 
osch-20130314092801-1513-3 contains 1 URI, 432 bytes
 
         432 hdfs://bda-ns/user/jdoe/moviework/data/part-00003
 
osch-20130314092801-1513-4 contains 1 URI, 202 bytes
 
         202 hdfs://bda-ns/user/jdoe/moviework/data/part-00004

SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 14 14:14:31 2013
 
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
 
Enter password: password
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> SELECT cust_id, movie_id, time_id FROM movie_fact_ext_tab_txt
  2  WHERE rownum < 5;

CUST_ID              MOVIE_ID             TIME_ID
-------------------- -------------------- --------------------
1150211              585                  01-JAN-11
1221463              9870                 01-JAN-11
1002672              1422                 01-JAN-11
1095718              544                  01-JAN-11

SQL> DESCRIBE movie_fact_ext_tab_txt
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                            VARCHAR2(4000)
 MOVIE_ID                                           VARCHAR2(4000)
 GENRE_ID                                           VARCHAR2(4000)
 TIME_ID                                            VARCHAR2(4000)
 RECOMMENDED                                        VARCHAR2(4000)
 ACTIVITY_ID                                        VARCHAR2(4000)
 RATINGS                                            VARCHAR2(4000)
 SALES                                              VARCHAR2(4000)
 
SQL> CREATE TABLE movie_facts AS
   2 SELECT CAST(cust_id AS VARCHAR2(12)) cust_id, 
   3      CAST(movie_id AS VARCHAR2(12)) movie_id,
   4      CAST(genre_id AS VARCHAR(3)) genre_id,
   5      TO_TIMESTAMP(time_id,'YYYY-MM-DD-HH24:MI:SS:FF') time,
   6      TO_NUMBER(recommended) recommended,
   7      TO_NUMBER(activity_id) activity_id, 
   8      TO_NUMBER(ratings) ratings, 
   9      TO_NUMBER(sales) sales
   10 FROM movie_fact_ext_tab_txt;

SQL> DESCRIBE movie_facts
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                            VARCHAR2(12)
 MOVIE_ID                                           VARCHAR2(12)
 GENRE_ID                                           VARCHAR2(3)
 TIME                                               TIMESTAMP(9)
 RECOMMENDED                                        NUMBER
 ACTIVITY                                           NUMBER
 RATING                                             NUMBER
 SALES                                              NUMBER

2.3 Configuring Your System for Oracle SQL Connector for HDFS

You can run Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:

For Hive sources, you must log in to a node in the Hadoop cluster.
For text and Data Pump format files, you can log in to either the Oracle Database system or a node in the Hadoop cluster.

Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH environment variable on the system where you log in. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "Setting Up User Accounts on the Oracle Database System" and "Setting Up User Accounts on the Hadoop Cluster".

Setting up the environment variables:

Verify that HADOOP_CLASSPATH includes the path to the JAR files for Oracle SQL Connector for HDFS:
```
path/orahdfs-2.2.0/jlib/*
```
If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH also includes the Hive JAR files and conf directory. For example:
```
/usr/lib/hive/lib/*
/etc/hive/conf
```
For your convenience, you can create an OSCH_HOME environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:
```
$ export OSCH_HOME="/opt/oracle/orahdfs-2.2.0"
```

See Also:

"Oracle SQL Connector for Hadoop Distributed File System Setup" for instructions for installing the software and setting up user accounts on both systems.

OSCH_HOME/doc/README.txt for information about known problems with Oracle SQL Connector for HDFS.

2.4 Using the ExternalTable Command-Line Tool

Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.

2.4.1 About ExternalTable

The ExternalTable tool uses the values of several properties to do the following tasks:

Create an external table
Populate the location files
Publish location files to an existing external table
List the location files
Describe an external table

You can specify these property values in an XML document or individually on the command line. See "Configuring Oracle SQL Connector for HDFS.".

2.4.2 ExternalTable Command-Line Tool Syntax

This is the full syntax of the ExternalTable command-line tool:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \ 
[-D property=value]... \
-createTable [--noexecute] 
  | -publish [--noexecute] 
  | -listlocations [--details]
  | -getDDL

You can either create the OSCH_HOME environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:

/opt/oracle/orahdfs-version

For example, you might run the ExternalTable command-line tool with a command like this:

hadoop jar /opt/oracle/orahdfs-2.2.0/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
     .
     .
     .

Parameter Descriptions

-conf config_file

Identifies the name of an XML configuration file containing properties needed by the command being executed. See "Configuring Oracle SQL Connector for HDFS."

-D property=value

Assigns a value to a specific property.

-createTable [--noexecute]

Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. Oracle recommends that you first execute a -createTable command with --noexecute.

-publish [--noexecute]

Publishes the data URIs to the location files of an existing external table. Use this option after adding new data files, so that the existing external table can access them.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE command and location files. The report also shows any errors. Oracle recommends that you first execute a -publish command with --noexecute.

See "Publishing the HDFS Data Paths."

-listLocations [--details]

Shows the location file content as text. With the --details option, this command provides a detailed listing. See "What Are Location Files?."

-getDDL

Prints the table definition of an existing external table. See "Describing External Tables."

See Also:

"Syntax Conventions"

2.5 Creating External Tables

You can create external tables automatically using the ExternalTable tool provided in Oracle SQL Connector for HDFS.

2.5.1 Creating External Tables with the ExternalTable Tool

To create an external table using the ExternalTable tool, follow the instructions for your data source:

Creating External Tables from Data Pump Format Files
Creating External Tables from Hive Tables
Creating External Tables from Delimited Text Files

When the ExternalTable -createTable command finishes executing, the external table is ready for use.

To create external tables manually, follow the instructions in "Creating External Tables in SQL."

ExternalTable Syntax for -createTable

Use the following syntax to create an external table and populate its location files:

hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-createTable [--noexecute]

2.5.2 Creating External Tables from Data Pump Format Files

Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.

Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER access driver with the preprocessor access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA, which enables ORACLE_LOADER to read the Data Pump format files generated by Oracle Loader for Hadoop.

Note:

Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com and search for bug 14557588.

Release 11.2.0.3 and later releases do not require this patch.

2.5.2.1 Required Properties

These properties are required:

oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.dataPaths
oracle.hadoop.exttab.sourceType=datapump
oracle.hadoop.connection.url
oracle.hadoop.connection.user

See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.

2.5.2.2 Optional Properties

This property is optional:

oracle.hadoop.exttab.logDirectory

2.5.2.3 Defining Properties in XML Files for Data Pump Format Files

Example 2-2 is an XML template containing all the properties that can be used to describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-2 XML Template with Properties for a Data Pump Format File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>datapump</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.logDirectory</name>
    <value>value</value>
  </property>
</configuration>

2.5.2.4 Example

Example 2-3 creates an external table named SALES_DP_XTAB to read Data Pump files.

Example 2-3 Defining an External Table for Data Pump Format Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /scratch/sales_dp_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/scratch/sales_dp_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;

Create the external table:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.3 Creating External Tables from Hive Tables

Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.

To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH variable. This means that Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.

Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH environment variable. You must have a correctly functioning Hive client.

For Hive managed tables, the data paths come from the warehouse directory.

For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.

The Oracle external table is not a "live" Hive table. When changes are made to a Hive table, you must use the ExternalTable tool to either republish the data or create a new external table.

2.5.3.1 Hive Table Requirements

Oracle SQL Connector for HDFS supports non-partitioned Hive tables that are defined using ROW FORMAT DELIMITED and FILE FORMAT TEXTFILE clauses. Both Hive-managed tables and Hive external tables are supported.

Hive tables can be either bucketed or not bucketed. Table columns with all primitive types from Hive 0.7.1 (CDH3), and the DECIMAL and TIMESTAMP types are supported.

2.5.3.2 Required Properties

These properties are required for Hive table sources:

oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.sourceType=hive
oracle.hadoop.exttab.hive.tableName
oracle.hadoop.exttab.hive.databaseName
oracle.hadoop.connection.url
oracle.hadoop.connection.user

See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.

2.5.3.3 Optional Properties

This property is optional for Hive table sources:

oracle.hadoop.exttab.locationFileCount

2.5.3.4 Defining Properties in XML Files for Hive Tables

Example 2-4 is an XML template containing all the properties that can be used to describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-4 XML Template with Properties for a Hive Table

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>hive</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.databaseName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
</configuration>

2.5.3.5 Example

Example 2-5 creates an external table named SALES_HIVE_XTAB to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.

Example 2-5 Defining an External Table for a Hive Table

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /scratch/sales_hive_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/scratch/sales_hive_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;

Create the external table:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.4 Creating External Tables from Delimited Text Files

Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. All text columns in the external table are VARCHAR2. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount property.

2.5.4.1 Required Properties

These properties are required for delimited text sources:

oracle.hadoop.exttab.tableName
oracle.hadoop.exttab.defaultDirectory
oracle.hadoop.exttab.dataPaths
oracle.hadoop.exttab.columnCount or oracle.hadoop.exttab.columnNames
oracle.hadoop.connection.url
oracle.hadoop.connection.user

See "Configuring Oracle SQL Connector for HDFS" for descriptions of the properties used for this data source.

2.5.4.2 Optional Properties

These properties are optional for delimited text sources:

oracle.hadoop.exttab.recordDelimiter
oracle.hadoop.exttab.fieldTerminator
oracle.hadoop.exttab.initialFieldEncloser
oracle.hadoop.exttab.trailingFieldEncloser
oracle.hadoop.exttab.locationFileCount

2.5.4.3 Defining Properties in XML Files for Delimited Text Files

Example 2-6 is an XML template containing all the properties that can be used to describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-6 XML Template with Properties for a Delimited Text File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>

<!-- Use either columnCount or columnNames -->

  <property>
    <name>oracle.hadoop.exttab.columnCount</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.columnNames</name>
    <value>value</value>
  </property>

  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.recordDelimiter</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.fieldTerminator</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.initialFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.trailingFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
</configuration>

2.5.4.4 Example

Example 2-7 creates an external table named SALES_DT_XTAB from delimited text files.

Example 2-7 Defining an External Table for Delimited Text Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /scratch/sales_dt_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/scratch/sales_dt_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;

Create the external table:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.5 Creating External Tables in SQL

You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable command.

Additional syntax might not be supported for Data Pump format files.

To create an external table manually:

Use the -createTable --noexecute command to generate the external table DDL.
Make whatever changes are needed to the DDL.
Run the DDL from Step 2 to create the table definition in the Oracle database.
Use the ExternalTable -publish command to publish the data URIs to the location files of the external table.

2.6 Publishing the HDFS Data Paths

The -createTable command creates the metadata in Oracle Database and populates the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS. However, you might publish the URIs as a separate step from creating the external table in cases like these:

You want to publish new data into an already existing external table.
You created the external table manually instead of using the ExternalTable tool.

In both cases, you can use ExternalTable with the -publish command to populate the external table location files with the URIs of the data files in HDFS. See "Location File Management".

ExternalTable Syntax for Publish

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-publish [--noexecute]

ExternalTable Command-Line Tool Example

Example 2-8 sets HADOOP_CLASSPATH and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.

Example 2-8 Publishing HDFS Data Paths to an External Table for Data Pump Format Files

This example uses the Bash shell.

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.exttab.connection.user=scott -publish

In this example:

OSCH_HOME is the full path to the Oracle SQL Connector for HDFS installation directory.
SALES_DP_XTAB is the external table created in Example 2-3.
hdfs:/user/scott/data/ is the location of the HDFS data.
@myhost:1521/orcl is the database connection string.

2.7 Listing Location File Metadata and Contents

The -listLocations command is a debugging and diagnostic utility that enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.

These properties are required to use this command:

oracle.hadoop.exttab.tableName
The JDBC connection properties; see "Connection Properties."

ExternalTable Syntax for -listLocations

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \ 
[-D property=value]... \
-listLocations [--details]

2.8 Describing External Tables

The -getDDL command is a debugging and diagnostic utility that prints the definition of an existing external table. This command follows the security model of the PL/SQL DBMS_METADATA package, which enables non-privileged users to see the metadata for their own objects.

These properties are required to use this command:

oracle.hadoop.exttab.tableName
The JDBC connection properties; see "Connection Properties."

ExternalTable Syntax for -getDDL

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-getDDL

2.9 More About External Tables Generated by the ExternalTable Tool

Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.

Oracle SQL Connector for HDFS uses the ORACLE_LOADER access driver. The hdfs_stream preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER can process.

See Also:

Oracle Database Administrator's Guide for information about external tables
Oracle Database Utilities for more information about external tables, performance hints, and restrictions when you are using the ORACLE_LOADER access driver.

2.9.1 What Are Location Files?

A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.

2.9.2 Enabling Parallel Processing

To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files, also known as the degree of parallelism, determines the number of child processes started by the external table during a table read. Ideally, the degree of parallelism is no larger than the number of data files, to avoid idle child processes.

2.9.3 Location File Management

The Oracle SQL Connector for HDFS command-line tool, ExternalTable, manages the location files of the external table. Location file management involves the following operations:

Generating new location files in the database directory after checking for name conflicts
Deleting existing location files in the database directory as necessary
Publishing data URIs to new location files
Altering the LOCATION clause of the external table to match the new location files

Location file management for the supported data sources is described in the following topics.

Data Pump file format

The ORACLE_LOADER access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.

Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.

Delimited files in HDFS and Hive tables

The ORACLE_LOADER access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount configuration property.

See "Connection Properties".

2.9.4 Location File Names

This is the format of a location file name:

osch-timestamp-number-n

In this syntax:

timestamp has the format yyyyMMddhhmmss, for example, 20121017103941 for October 17, 2012, at 10:39:41.
number is a random number used to prevent location file name conflicts among different tables.
n is an index used to prevent name conflicts between location files for the same table.

For example, osch-20121017103941-6807-1.

2.10 Configuring Oracle SQL Connector for HDFS

You can pass configuration properties to the ExternalTable tool on the command line with the -D option, or you can create a configuration file and pass it on the command line with the -conf option. These options must precede the command to be executed (-createTable, -publish, -listLocations, or -getDDL).

For example, this command uses a configuration file named example.xml:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/oracle/example.xml \
       -createTable

See "ExternalTable Command-Line Tool Syntax".

2.10.1 Creating a Configuration File

A configuration file is an XML document with a very simple structure as follows:

<?xml version="1.0"?>
<configuration>
  <property>
    <name>property</name>
    <value>value</value>
  </property>
     .
     .
     .
</configuration>

Example 2-9 shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.

Example 2-9 Configuration File for Oracle SQL Connector for HDFS

<?xml version="1.0"?>
<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>SH.SALES_EXT_DIR</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>/data/s1/*.csv,/data/s2/*.csv</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataCompressionCodec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>SH</value>
  </property>
</configuration>

2.10.2 Oracle SQL Connector for HDFS Configuration Property Reference

The following is a complete list of the configuration properties used by the ExternalTable command-line tool. The properties are organized into these categories:

General Properties
Connection Properties

General Properties

oracle.hadoop.exttab.columnCount

The number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property.

This property is ignored if oracle.hadoop.exttab.columnNames is set.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set one of these properties when creating an external table from delimited text files:

oracle.hadoop.exttab.columnNames
oracle.hadoop.exttab.columnCount

oracle.hadoop.exttab.columnNames

A comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the oracle.hadoop.exttab.columnCount property.

The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set one of these properties when creating an external table from delimited text files:

oracle.hadoop.exttab.columnNames
oracle.hadoop.exttab.columnCount

oracle.hadoop.exttab.dataCompressionCodec

The name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional.

This property specifies the class name of the compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec interface. This codec applies to all data files.

Several standard codecs are available in Hadoop, including the following:

bzip2: org.apache.hadoop.io.compress.BZip2Codec
gzip: org.apache.hadoop.io.compress.GzipCodec

Default value: None

oracle.hadoop.exttab.dataPaths

A comma-separated list of fully qualified HDFS paths. This parameter enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-1. This property is required for the -createTable and -public commands using Data Pump or delimited text files. The property is ignored for Hive data sources.

For example, to select all files in /data/s2/, and only the CSV files in /data/s7/, /data/s8/, and /data/s9/, enter this expression:

/data/s2/,/data/s[7-9]/*.csv

The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.

The data set can contain compressed files or uncompressed files, but not both.

Table 2-1 Pattern-Matching Characters

Character	Description
?	Matches any single character
*	Matches zero or more characters
[`abc`]	Matches a single character from the character set {a, b, c}
[`a-b`]	Matches a single character from the character range {a...b}. The character a must be less than or equal to b.
[^`a`]	Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket.
\`c`	Removes any special meaning of character c. The backslash is the escape character.
{`ab`\,`cd`}	Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.
{`ab`\,`c`{`de\`,`fh`}}	Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

oracle.hadoop.exttab.dataPathFilter

The path filter class. This property is ignored for Hive data sources.

Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered.

oracle.hadoop.exttab.defaultDirectory

Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object.

Valid value: The name of an existing database directory

Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names.

The -createTable command requires this property.

oracle.hadoop.exttab.fieldTerminator

Specifies the field terminator for an external table when oracle.hadoop.exttab.sourceType=text. Optional.

Default value: , (comma)

Valid values: A string in one of the following formats:

One or more regular printable characters; it cannot start with \u. For example, \t represents a tab.
One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. For example, \u0009 represents a tab. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.hive.databaseName

The name of a Hive database that contains the input data table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.hive.tableName

The name of an existing Hive table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.initialFieldEncloser

Specifies the initial field encloser for an external table created from delimited text files. Optional.

Default value: null; no enclosers are specified for the external table definition.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.locationFileCount

Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files.

Default value: 4

This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:

The number of data files
The value of this property

At least one location file is created.

See "Enabling Parallel Processing" for more information about the number of location files.

oracle.hadoop.exttab.logDirectory

Specifies a database directory where log files, bad files, and discard files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by _%p.log.

This is an optional property for the -createTable command.

These are the default file name extensions:

Log files: log
Bad files: bad
Discard files: dsc

Valid values: An existing Oracle directory object name.

Unquoted names are uppercased. Quoted names are not changed. Table 2-2 provides examples of how values are transformed.

Table 2-2 Examples of Quoted and Unquoted Values

Specified Value	Interpreted Value
`my_log_dir:'sales_tab_%p.log '`	MY_LOG_DIR/sales_tab_%p.log
`'my_log_dir':'sales_tab_%p.log'`	my_log_dir/sales_tab_%p.log
`"my_log_dir":"sales_tab_%p.log"`	my_log_dir/sales_tab_%p.log

oracle.hadoop.exttab.preprocessorDirectory

Specifies the database directory for the preprocessor. The file-system directory must contain the hdfs_stream script.

Default value: OSCH_BIN_PATH

The preprocessor directory is used in the PREPROCESSOR clause of the external table.

oracle.hadoop.exttab.recordDelimiter

Specifies the record delimiter for an external table created from delimited text files. Optional.

Default value: \n

The -createTable command uses this parameter when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.sourceType

Specifies the type of source files.

The valid values are datapump, hive, and text.

Default value: text

The -createTable and -publish operations require the value of this parameter.

oracle.hadoop.exttab.tableName

Schema-qualified name of the external table in this format:

schemaName.tableName

If you omit schemaName, then the schema name defaults to the connection user name.

Default value: none

Required property for all operations.

oracle.hadoop.exttab.trailingFieldEncloser

Specifies the trailing field encloser for an external table created from delimited text files. Optional.

Default value: null; defaults to the value of oracle.hadoop.exttab.initialFieldEncloser

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

One or more regular printable characters; it cannot start with \u.
One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

Connection Properties

oracle.hadoop.connection.url

Specifies the database connection string in the thin-style service name format:

jdbc:oracle:thin:@//host_name:port/service_name

If you are unsure of the service name, then enter this SQL command as a privileged user:

SQL> show parameter service

If an Oracle wallet is configured as an external password store, then the property value must start with the driver prefix jdbc:oracle:thin:@ and db_connect_string must exactly match the credentials defined in the wallet.

This property takes precedence over all other connection properties.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.user

An Oracle database log-in name. The externalTable tool prompts for a password. This parameter is required unless you are using Oracle wallet as an external password store.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.tnsEntryName

Specifies a TNS entry name defined in the tnsnames.ora file.

This property is used with the oracle.hadoop.connection.tns_admin property.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.tns_admin

Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java oracle.net.tns_admin property. When both are set, this property takes precedence over oracle.net.tns_admin.

This property must be set when using Oracle Wallet as an external password store. See oracle.hadoop.connection.wallet_location.

Default value: The value of the Java oracle.net.tns_admin system property

Valid values: A string

oracle.hadoop.connection.wallet_location

A file path to an Oracle wallet directory where the connection credential is stored.

Default value: Not defined

Valid values: A string

When using Oracle Wallet as an external password store, set these properties:

oracle.hadoop.connection.wallet_location
oracle.hadoop.connection.url or oracle.hadoop.connection.tnsEntryName
oracle.hadoop.connection.tns_admin

2.11 Performance Tips for Querying Data in HDFS

Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are inserting data.