2 Oracle SQL Connector for Hadoop Distributed File System

This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.

This chapter contains the following sections:

2.1 About Oracle SQL Connector for HDFS

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Apache Hadoop in these formats:

  • Data Pump files in HDFS

  • Delimited text files in HDFS

  • Delimited text files in Apache Hive tables

For other file formats, such as JSON files, you can stage the input as delimited text in a new Hive table and then use Oracle SQL Connector for HDFS. Partitioned Hive tables are supported, enabling you to represent a subset of Hive table partitions in Oracle Database, instead of the entire Hive table.

Oracle SQL Connector for HDFS uses external tables and database views to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. Oracle SQL Connector for HDFS creates database views over external tables to support access to partitioned Hive tables. By querying the external tables or views, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.

To create these objects in Oracle Database, you use the ExternalTable command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as options to the ExternalTable command or in an XML file. The ExternalTable command-line tool can be used from either a shell or from SQL Developer.

When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.

You can also perform bulk loads of data into Oracle database tables using SQL.You may prefer that the data resides in an Oracle database (either all of it or just a selection) if it is queried routinely. Oracle SQL Connector for HDFS functions as a Hadoop client running on the Oracle database and uses the external table preprocessor hdfs_stream to access data in HDFS. Oracle Shell for Hadoop Loaders has commands to create and external table and do a bulk load in one step.

See Also:

The following Oracle blog provides information on using Oracle SQL Developer: https://blogs.oracle.com/bigdataconnectors/move-data-between-apache-hadoop-and-oracle-database-with-sql-developer

2.2 Getting Started With Oracle SQL Connector for HDFS

The following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.

  1. Log in to a system where Oracle SQL Connector for HDFS is installed, which can be the Oracle Database system, a node in the Hadoop cluster, or a system set up as a remote client for the Hadoop cluster.
  2. The first time you use Oracle SQL Connector for HDFS, ensure that the software is configured.

    See "Configuring Your System for Oracle SQL Connector for HDFS." You might also need to edit hdfs_stream if your environment is unique. See "Installing Oracle SQL Connector for HDFS".

  3. If you are connecting to a secure cluster, then run kinit to authenticate yourself.
  4. Create an XML document describing the connections and the data source, unless you are providing these properties in the ExternalTable command.
  5. Create a shell script containing an ExternalTable command.
  6. Run the shell script.
  7. If the job fails, then use the diagnostic messages in the output to identify and correct the error. Depending on how far the job progressed before failing, you may need to delete the table definition from the Oracle database before rerunning the script.
  8. After the job succeeds, connect to Oracle Database as the owner of the external table. Query the table to ensure that the data is accessible.
  9. If the data will be queried frequently, then you may want to load it into a database table to improve querying performance. External tables do not have indexes or partitions.

    If you want the data to be compressed as it loads into the table, then create the table with the COMPRESS option.

  10. To delete the Oracle Database objects created by Oracle SQL Connector for HDFS, use the -drop command.

Example 2-1 Accessing HDFS Data Files from Oracle Database

The following illustrates these steps:

$ cat moviefact_hdfs.sh
# Add environment variables
export OSCH_HOME="/u01/connectors/orahdfs-<version>"
 
hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/oracle/movies/moviefact_hdfs.xml \
       -createTable

$ cat moviefact_hdfs.xml
<?xml version="1.0"?>
 <configuration>
    <property>
      <name>oracle.hadoop.exttab.tableName</name>
      <value>MOVIE_FACTS_EXT</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.locationFileCount</name>
      <value>4</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.dataPaths</name>
      <value>/user/oracle/moviework/data/part*</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.fieldTerminator</name>
      <value>\u0009</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.defaultDirectory</name>
      <value>MOVIEDEMO_DIR</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.nullIfSpecifier</name>
      <value>\N</</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.columnNames</name>
      <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.TIME_ID.columnType</name>
      <value>TIMESTAMP</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.timestampMask</name>
      <value>YYYY-MM-DD:HH:MI:SS</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.RECOMMENDED.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.RATING.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.SALES.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.sourceType</name>
      <value>text</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.url</name>
      <value>jdbc:oracle:thin:@localhost:1521:orcl</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.user</name>
      <value>MOVIEDEMO</value>
    </property>  
</configuration>

$ sh moviefact_hdfs.sh
Oracle SQL Connector for HDFS Release 3.4.0 - Production
 
Copyright (c) 2011, 2015, Oracle and/or its affiliates. All rights reserved.
 
[Enter Database Password: password]
The create table command succeeded.
 
CREATE TABLE "MOVIEDEMO"."MOVIE_FACTS_EXT"
(
 "CUST_ID"                        VARCHAR2(4000),
 "MOVIE_ID"                       VARCHAR2(4000),
 "GENRE_ID"                       VARCHAR2(4000),
 "TIME_ID"                        TIMESTAMP(9),
 "RECOMMENDED"                    NUMBER,
 "ACTIVITY_ID"                    NUMBER,
 "RATING"                         NUMBER,
 "SALES"                          NUMBER
)
ORGANIZATION EXTERNAL
( 
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "MOVIEDEMO_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     CHARACTERSET AL32UTF8
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'09'
     MISSING FIELD VALUES ARE NULL
     (
       "CUST_ID" CHAR(4000),
       "MOVIE_ID" CHAR(4000),
       "GENRE_ID" CHAR(4000),
       "TIME_ID" CHAR,
       "RECOMMENDED" CHAR,
       "ACTIVITY_ID" CHAR,
       "RATING" CHAR,
       "SALES" CHAR
     )
   )
   LOCATION
   (
     'osch-20141114064206-5250-1',
     'osch-20141114064206-5250-2',
     'osch-20141114064206-5250-3',
     'osch-20141114064206-5250-4'
   )
) PARALLEL REJECT LIMIT UNLIMITED;
 
The following location files were created.
 
osch-20141114064206-5250-1 contains 1 URI, 12754882 bytes
 
    12754882 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00001
 
osch-20141114064206-5250-2 contains 1 URI, 438 bytes
 
         438 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00002
 
osch-20141114064206-5250-3 contains 1 URI, 432 bytes
 
         432 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00003
 
osch-20141114064206-5250-4 contains 1 URI, 202 bytes
 
         202 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00004

$ sqlplus moviedemo
 
SQL*Plus: Release 12.1.0.1.0 Production on Fri Apr 18 09:24:18 2014
 
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
 
Enter password: password
Last Successful login time: Thu Apr 17 2014 18:42:01 -05:00
 
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> DESCRIBE movie_facts_ext;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                            VARCHAR2(4000)
 MOVIE_ID                                           VARCHAR2(4000)
 GENRE_ID                                           VARCHAR2(4000)
 TIME_ID                                            TIMESTAMP(9)
 RECOMMENDED                                        NUMBER
 ACTIVITY_ID                                        NUMBER
 RATING                                             NUMBER
 SALES                                              NUMBER

SQL> CREATE TABLE movie_facts AS SELECT * FROM movie_facts_ext;
 
Table created.

SQL> SELECT movie_id, time_id, recommended, rating FROM movie_facts WHERE rownum < 5;
 
MOVIE_ID TIME_ID                          RECOMMENDED     RATING
-------- -------------------------------- ----------- ----------
205      03-DEC-10 03.14.54.000000000 AM            1          1
77       14-AUG-11 10.46.55.000000000 AM            1          3
116      24-NOV-11 05.43.00.000000000 AM            1          5
141      01-JAN-11 05.17.57.000000000 AM            1          4

2.3 Configuring Your System for Oracle SQL Connector for HDFS

You can run the ExternalTable command-line tool provided with Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:

  • For Hive sources, log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.

  • For text and Data Pump format files, log in to either the Oracle Database system or a node in the Hadoop cluster.

Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH environment variable on the system where you log in to run the tool. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "OS-Level Requirements for OSCH Users".

Setting up the environment variables:

  • Verify that HADOOP_CLASSPATH includes the path to the JAR files for Oracle SQL Connector for HDFS:

    path/orahdfs-<version>/jlib/*
    
  • If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH also includes the Hive JAR files and conf directory. For example:

    /usr/lib/hive/lib/*
    /etc/hive/conf
    
  • For your convenience, you can create an OSCH_HOME environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:

    $ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
    

See Also:

2.4 Using Oracle SQL Connector for HDFS with Oracle Big Data Appliance and Oracle Exadata

Oracle SQL Connector for HDFS is a command-line utility that accepts generic command line arguments supported by the org.apache.hadoop.util.Tool interface. It also provides a preprocessor for Oracle external tables. See the Oracle Big Data Appliance Software User's Guide for instructions on configuring Oracle Exadata Database Machine for Use with Oracle Big Data Appliance.

2.5 Using the ExternalTable Command-Line Tool

Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.

2.5.1 About ExternalTable

The ExternalTable tool uses the values of several properties to do the following tasks:

  • Create an external table

  • Populate the location files

  • Publish location files to an existing external table

  • List the location files

  • Describe an external table

You can specify these property values in an XML document or individually on the command line. .

2.5.2 ExternalTable Command-Line Tool Syntax

This is the full syntax of the ExternalTable command-line tool, which is run using the hadoop command:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \ 
[-D property=value]... \
-createTable [--noexecute [--output filename.sql]] 
  | -drop [--noexecute]
  | -describe
  | -publish [--noexecute] 
  | -listlocations [--details]
  | -getDDL

You can either create the OSCH_HOME environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:

/opt/oracle/orahdfs-<version>

For example, you might run the ExternalTable command-line tool with a command like this:

hadoop jar /opt/oracle/orahdfs-<version>/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
     .
     .
     .

Generic Options and User Commands

-conf config_file

Identifies the name of an XML configuration file containing properties needed by the command being executed.

-D property=value

Assigns a value to a specific property.

-createTable [--noexecute [--output filename]]

Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files. Oracle SQL Connector for HDFS also checks the database to ensure that the required database directories exist and that you have the necessary permissions.

For partitioned Hive tables, Oracle SQL Connector for HDFS creates external tables, views, and a metadata table. See Table 2-2.

Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. The --output option writes the table DDL from the -createTable command to a file. Oracle recommends that you first execute a -createTable command with --noexecute.

-drop [--noexecute]

Deletes one or more Oracle Database objects created by Oracle SQL Connector for HDFS to support a particular data source. Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources. An error occurs if you attempt to drop a table or view that Oracle SQL Connector for HDFS did not create.

Use the --noexecute option to list the objects to be deleted.

-describe

Provides information about the Oracle Database objects created by Oracle SQL Connector for HDFS. Use this command instead of -getDDL or -listLocations.

-publish [--noexecute]

Publishes the data URIs to the location files of an existing external table. Use this command after adding new data files, so that the existing external table can access them.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE command and location files. The report also shows any errors.

Oracle recommends that you first execute a -publish command with --noexecute.

See "Updating External Tables."

-listLocations [--details]

Shows the location file content as text. With the --details option, this command provides a detailed listing. This command is deprecated in release 3.0. Use “-describe” instead.

-getDDL

Prints the table definition of an existing external table. This command is deprecated in release 3.0. Use “-describe” instead.

See Also:

"Syntax Conventions"

2.6 Creating External Tables

You can create external tables automatically using the ExternalTable tool provided in Oracle SQL Connector for HDFS.

2.6.1 Creating External Tables with the ExternalTable Tool

To create an external table using the ExternalTable tool, follow the instructions for your data source:

When the ExternalTable -createTable command finishes executing, the external table is ready for use. ExternalTable also manages the location files for the external table. See "Location File Management."

To create external tables manually, follow the instructions in "Creating External Tables in SQL."

ExternalTable Syntax for -createTable

Use the following syntax to create an external table and populate its location files:

hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-createTable [--noexecute]

2.6.2 Creating External Tables from Data Pump Format Files

Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.

Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER access driver with the preprocessor access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA, which enables ORACLE_LOADER to read the Data Pump format files generated by Oracle Loader for Hadoop.

To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

Note:

Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com and search for bug 14557588.

Release 11.2.0.3 and later releases do not require this patch.

2.6.2.3 Defining Properties in XML Files for Data Pump Format Files

The following example is an XML template containing the properties that describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-2 XML Template with Properties for a Data Pump Format File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>datapump</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.logDirectory</name>
    <value>value</value>
  </property>
</configuration>

2.6.2.4 Example

The following example creates an external table named SALES_DP_XTAB to read Data Pump files.

Example 2-3 Defining an External Table for Data Pump Format Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory. For Oracle RAC, you must create a clusterwide directory on a distributed file system.

$ mkdir /data/sales_dp_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/data/sales_dp_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:$HADOOP_CLASSPATH"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.6.3 Creating External Tables from Hive Tables

Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.

To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH variable. Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.

Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH environment variable. You must have a correctly functioning Hive client.

For Hive managed tables, the data paths come from the warehouse directory.

For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.

The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable tool to drop the existing external table and create a new one.

To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

2.6.3.1 Hive Table Requirements

Oracle SQL Connector for HDFS supports Hive tables that are defined using ROW FORMAT DELIMITED and FILE FORMAT TEXTFILE clauses. Both Hive-managed tables and Hive external tables are supported.

Oracle SQL Connector for HDFS also supports partitioned Hive tables. In this case Oracle SQL Connector for HDFS creates one or more external tables and database views. See "Creating External Tables from Partitioned Hive Tables".

Hive tables can be either bucketed or not bucketed. All primitive types from Hive 0.10.0 are supported.

2.6.3.2 Data Type Mappings

The following table shows the default data-type mappings between Hive and Oracle. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.hive.columnType.* properties listed under "Optional Properties.".

Table 2-1 Hive Data Type Mappings

Data Type of Source Hive Column Default Data Type of Target Oracle Column

INT, BIGINT, SMALLINT, TINYINT

INTEGER

DECIMAL

NUMBER

DECIMAL(p,s)

NUMBER(p,s)

DOUBLE, FLOAT

NUMBER

DATE

DATE with format mask YYYY-MM-DD

TIMESTAMP

TIMESTAMP with format mask YYYY-MM-DD HH24:MI:SS.FF

BOOLEAN

VARCHAR2(5)

CHAR(size)

CHAR(size)

STRING

VARCHAR2(4000)

VARCHAR

VARCHAR2(4000)

VARCHAR(size)

VARCHAR2(size)

2.6.3.5 Defining Properties in XML Files for Hive Tables

The following example is an XML template containing the properties that describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-4 XML Template with Properties for a Hive Table

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>hive</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.partitionFilter</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.databaseName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.columnType.TYPE</name>
    <value>value</value>
  </property>
</configuration>

2.6.3.6 Example

This example creates an external table named SALES_HIVE_XTAB to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.

Example 2-5 Defining an External Table for a nonpartitioned Hive Table

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /data/sales_hive_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/data/sales_hive_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:/usr/lib/hive/lib/*:/etc/hive/conf:$HADOOP_CLASSPATH"

$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

Note:

For nonpartitioned Hive tables and other data sources the value for property oracle.hadoop.exttab.tableName is the name of the external table.

2.6.3.7 Creating External Tables from Partitioned Hive Tables

Oracle SQL Connector for HDFS supports partitioned Hive tables, enabling you to query a single partition, a range of partitions, or all partitions. You can represent all Hive partitions or a subset of them in Oracle Database.

See Also:

"Creating External Tables from Hive Tables" for required properties, data type mappings, and other details applicable to all Hive table access using Oracle SQL Connector for HDFS.

2.6.3.7.1 Database Objects that Support Access to Partitioned Hive Tables

To support a partitioned Hive table, Oracle SQL Connector for HDFS creates the objects described in the following table.

Table 2-2 Oracle Database Objects for Supporting a Partitioned Hive Table

Database Object Description Naming Convention(1)

External Tables

One for each Hive partition

OSCHtable_name_n

For example, OSCHDAILY_1 and OSCHDAILY_2

Views

One for each external table. Used for querying the Hive data.

table_name_n

For example, DAILY_1 and DAILY_2

Metadata Table

One for the Hive table. Identifies all external tables and views associated with a particular Hive table. Specify this table when creating, describing, or dropping these database objects.

table_name

For example, DAILY

Footnote 1 The "_n" suffixed with table name indicates a numeric value.

For example, if a Hive table comprises five partitions, then Oracle SQL Connector for HDFS creates five external tables, five views, and one metadata table in Oracle Database.

To drop the objects described in Table 2-2 and the location files, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

Note:

For partitioned Hive tables and other data sources the value for property oracle.hadoop.exttab.tableName is the name of the metadata table.

2.6.3.7.2 Querying the Metadata Table

The metadata table provides critical information about how to query the Hive table. The following table describes the columns of a metadata table.

Table 2-3 Metadata Table Columns

Column Description

VIEW_NAME

The Oracle Database view used to access a single Hive table partition. The view contains both Hive table and partition columns.

EXT_TABLE_NAME

An Oracle external table that represents a Hive table partition. The external table contains only the Hive table columns and not the Hive partition columns.

To access all the data in a Hive partition, use the corresponding Oracle Database view.

HIVE_TABLE_NAME

The partitioned Hive table being accessed through Oracle Database.

HIVE_DB_NAME

The Hive database where the table resides.

HIVE­_PART_FILTER

The Hive partition filter used to select a subset of partitions for access by Oracle Database. A NULL value indicates that all partitions are accessed.

Partition Columns

Each column used to partition the Hive table has a separate column in the metadata table. For example, the metadata table has columns for COUNTRY, STATE, and CITY for a Hive table partitioned by a combination of COUNTRY, STATE, and CITY values.

The following SELECT statement queries a metadata table named HIVE_SALES_DATA:

SQL> SELECT view_name, ext_table_name, Hive_table_name, \
      hive_db_name, country, city \
      FROM hive_sales_data \
      WHERE state = 'TEXAS';

The results of the query identify three views with data from cities in Texas:

VIEW_NAME          EXT_TABLE_NAME         HIVE_TABLE_NAME   HIVE_DB_NAME  COUNTRY  CITY 
------------------------------------------------------------------------------------------
HIVE_SALES_DATA_1  OSCHHIVE_SALES_DATA_1  hive_sales_data   db_sales      US       AUSTIN
HIVE_SALES_DATA_2  OSCHHIVE_SALES_DATA_2  hive_sales_data   db_sales      US       HOUSTON
HIVE_SALES_DATA_3  OSCHHIVE_SALES_DATA_3  hive_sales_data   db_sales      US       DALLAS

The views include partition column values. Oracle recommends that you use the views while querying a partitioned Hive table, as the external tables do not include the partition column values.

2.6.3.7.3 Creating UNION ALL Views for Querying

To facilitate querying, you can create UNION ALL views over the individual partition views. Use the mkhive_unionall_view.sql script, which is provided in the OSCH_HOME/example/sql directory. To maintain performance, do not create UNION ALL views over more than 50 to 100 views (depending on their size).

To use mkhive_unionall_view.sql, use the following syntax:

@mkhive_unionall_view[.sql] table schema view predicate

MKHIVE_UNIONALL_VIEW Script Parameters

table

The name of the metadata table in Oracle Database that represents a partitioned Hive table. Required.

schema

The owner of the metadata table. Optional; defaults to your schema.

view

The name of the UNION ALL view created by the script. Optional; defaults to table_ua.

predicate

A WHERE condition used to select the partitions in the Hive table to include in the UNION ALL view. Optional; defaults to all partitions.

Example 2-6 Union All Views for Partitioned Hive Tables

The following example creates a UNION ALL view named HIVE_SALES_DATA_UA, which accesses all partitions listed in the HIVE_SALES_DATA metadata table:

SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null null null
 

This example creates a UNION ALL view named ALL_SALES, which accesses all partitions listed in the HIVE_SALES_DATA metadata table:

SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null ALL_SALES null
 

The next example creates a UNION ALL view named TEXAS_SALES_DATA, which accesses the rows of all partitions where STATE = 'TEXAS'.

SQL> @mkhive_unionallview.sql HIVE_SALES_DATA null TEXAS_SALES_DATA '(STATE = ''''TEXAS'''')'
2.6.3.7.4 Error Messages
table name too long, max limit length

Cause: The names generated for the database objects exceed 30 characters.

Action: Specify a name that does not exceed 24 characters in the oracle.hadoop.exttab.tableName property. Oracle SQL Connector for HDFS generates external table names using the convention OSCHtable_name_n. See Table 2-2.

table/view names containing string table_name found in schema schema_name

Cause: An attempt was made to create external tables for a partitioned Hive table, but the data objects already exist.

Action: Use the hadoop -drop command to drop the existing tables and views, and then retry the -createTable command. If this solution fails, then you might have "dangling" objects. See "Dropping Dangling Objects".

2.6.3.7.5 Dropping Dangling Objects

Always use Oracle SQL Connector for HDFS commands to manage objects created by the connector to support partitioned Hive tables. Dangling objects are caused when you use the SQL drop table command to drop a metadata table instead of the -drop command. If you are unable to drop the external tables and views for a partitioned Hive table, then they are dangling objects.

Notice the schema and table names in the error message generated when you attempted to drop the objects, and use them in the following procedure.

To drop dangling database objects:

  1. Open a SQL session with Oracle Database, and connect as the owner of the dangling objects.

  2. Identify the location files of the external table by querying the ALL_EXTERNAL_LOCATIONS and ALL_EXTERNAL_TABLES data dictionary views:

    SELECT a.table_name, a.directory_name, a.location \
    FROM all_external_locations a, all_external_tables b \
    WHERE a.table_name = b.table_name AND a.table_name \
    LIKE 'OSCHtable%' AND a.owner='schema'; 
     
    

    In the LIKE clause of the previous syntax, replace table and schema with the appropriate values.

    In the output, the location file names have an osch- prefix, such as osch-20140408014604-175-1.

  3. Identify the external tables by querying the ALL_EXTERNAL_TABLES data dictionary view:

    SELECT table_name FROM all_external_tables \
    WHERE table_name \
    LIKE 'OSCHtable%' AND owner=schema;
     
    
  4. Identify the database views by querying the ALL_VIEWS data dictionary view:

    SELECT view_name FROM all_views 
    WHERE view_name 
    LIKE 'table%' AND owner='schema';
     
    
  5. Inspect the tables, views, and location files to verify that they are not needed, using commands like the following:

    DESCRIBE schema.table;
    SELECT * FROM schema.table;
    
    DESCRIBE schema.view;
    SELECT * FROM schema.view;
    
  6. Delete the location files, tables, and views that are not needed, using commands like the following:

    EXECUTE utl_file.fremove('directory', 'location_file');
    
    DROP TABLE schema.table;
    DROP VIEW schema.view;
    

2.6.4 Creating External Tables from Delimited Text Files

Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. By default, all text columns in the external table are VARCHAR2. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount property.

2.6.4.1 Data Type Mappings

All text data sources are automatically mapped to VARCHAR2(4000). To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.colMap.* properties listed under "Optional Properties."

2.6.4.4 Defining Properties in XML Files for Delimited Text Files

This example is an XML template containing all the properties that describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-7 XML Template with Properties for a Delimited Text File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>

<!-- Use either columnCount or columnNames -->

  <property>
    <name>oracle.hadoop.exttab.columnCount</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.columnNames</name>
    <value>value</value>
  </property>

  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.colMap.TYPE</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.recordDelimiter</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.fieldTerminator</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.initialFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.trailingFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
</configuration>

2.6.4.5 Example

This example creates an external table named SALES_DT_XTAB from delimited text files.

Example 2-8 Defining an External Table for Delimited Text Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /data/sales_dt_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/data/sales_dt_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-<version>"
$ export HADOOP_CLASSPATH="$OSCH_HOME/jlib/*:$HADOOP_CLASSPATH"

$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.6.5 Creating External Tables in SQL

You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable command.

Additional syntax might not be supported for Data Pump format files.

To create an external table manually:

  1. Use the -createTable --noexecute command to generate the external table DDL.

  2. Make whatever changes are needed to the DDL.

  3. Run the DDL from the previous step to create the table definition in the Oracle database.

  4. Use the ExternalTable -publish command to publish the data URIs to the location files of the external table.

2.7 Updating External Tables

The -publish option provides a way to update the location files (that point to HDFS data paths) of existing Oracle external tables based on text, Data Pump, or Hive sources.

The ExternalTable command line tool with the -createTable option creates the external table and related metadata in Oracle Database. It also populates the external table’s location files with the Universal Resource Identifiers (URIs) of the data files in HDFS.

To “publish” updates to the location files of existing external tables, use ExternalTable with the -publish option. This operation updates the location files of external tables with new URIs of HDFS data paths, and adds external tables and views for new partitions when the source is a partitioned Hive table

Use -publish in order to:

  • Publish new data into an already existing external table.

    -createTable takes a snapshot of the HDFS or Hive source at the time of external table creation. However the source may change later. The -publish option enables you to update the existing external table from the source.

  • Publish new data when the source is a partitioned Hive table.

    When new partitions are added to a source that is a partitioned Hive table, the -publish option enables you to add new database objects required to access the new partitions. This option detects the new Hive partitions and creates the additional external tables and views for the partitions and updates the metadata table created with the -createTable command.

  • Populate an external table that you created manually using the -createTable command with the --noexecute option.

Note:

The publish option now fully supports partitioned Hive tables. It is no longer necessary to use -drop and -createTable as a workaround to update external tables derived from partitioned Hive tables.

2.7.1 ExternalTable Syntax for Publish

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-publish [--noexecute]

2.7.2 ExternalTable Example for Publish

This example sets HADOOP_CLASSPATH and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.

Example 2-9 Publishing HDFS Data Paths to an External Table for Data Pump Format Files

This example uses the Bash shell.

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -publish

In this example:

  • OSCH_HOME is the full path to the Oracle SQL Connector for HDFS installation directory.

  • SALES_DP_XTAB is the external table created in Example 2-3.

  • hdfs:/user/scott/data/ is the location of the HDFS data.

  • @myhost:1521 is the database connection string.

2.8 Exploring External Tables and Location Files

The -describe command is a debugging and diagnostic utility that prints the definition of an existing external table. It also enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.

These properties are required to use this command:

2.8.1 ExternalTable Syntax for Describe

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-describe

2.8.2 ExternalTable Example for Describe

This example shows the command syntax to describe the external tables and location files associated with SALES_DP_XTAB.

Example 2-10 Exploring External Tables and Location Files

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -describe

2.9 Dropping Database Objects Created by Oracle SQL Connector for HDFS

The -drop command deletes the database objects created by Oracle SQL Connector for HDFS. These objects include external tables, location files, and views. If you delete objects manually, problems can arise as described in "Dropping Dangling Objects".

The -drop command only deletes objects created by Oracle SQL Connector for HDFS. Oracle recommends that you always use the -drop command to drop objects created by Oracle SQL Connector for HDFS.

These properties are required to use this command:

2.9.1 ExternalTable Syntax for Drop

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-drop

2.9.2 ExternalTable Example for Drop

This example shows the command syntax to drop the database objects associated with SALES_DP_XTAB.

Example 2-11 Dropping Database Objects

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -drop

2.10 More About External Tables Generated by the ExternalTable Tool

Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.

Oracle SQL Connector for HDFS uses the ORACLE_LOADER access driver. The hdfs_stream preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER can process.

2.10.1 About Configurable Column Mappings

Oracle SQL Connector for HDFS uses default data type mappings to create columns in an Oracle external table with the appropriate data types for the Hive and text sources. You can override these defaults by setting various configuration properties, for either all columns or a specific column.

For example, a field in a text file might contain a timestamp. By default, the field is mapped to a VARCHAR2 column. However, you can specify a TIMESTAMP column and provide a datetime mask to cast the values correctly into the TIMESTAMP data type. The TIMESTAMP data type supports time-based queries and analysis that are unavailable when the data is presented as text.

2.10.1.1 Default Column Mappings

Text sources are mapped to VARCHAR2 columns, and Hive columns are mapped to columns with the closest equivalent Oracle data type. Table 2-1 shows the default mappings.

2.10.1.2 All Column Overrides

The following properties apply to all columns in the external table. For Hive sources, these property settings override the oracle.hadoop.exttab.hive.* property settings.

2.10.1.4 Mapping Override Examples

The following properties create an external table in which all columns are the default VARCHAR2 data type:

oracle.hadoop.exttab.tableName=MOVIE_FACT_EXT_TAB_TXT
oracle.hadoop.exttab.columnNames=CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES

In this example, the following properties are set to override the data type of several columns:

oracle.hadoop.exttab.colMap.TIME_ID.columnType=TIMESTAMP
oracle.hadoop.exttab.colMap.RECOMMENDED.columnType=NUMBER
oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType=NUMBER
oracle.hadoop.exttab.colMap.RATING.columnType=NUMBER
oracle.hadoop.exttab.colMap.SALES.columnType=NUMBER

Oracle SQL Connector for HDFS creates an external table with the specified data types:

SQL> DESCRIBE movie_facts_ext
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUST_ID                                            VARCHAR2(4000)
MOVIE_ID                                           VARCHAR2(4000)
GENRE_ID                                           VARCHAR2(4000)
TIME_ID                                            TIMESTAMP(9)
RECOMMENDED                                        NUMBER
ACTIVITY_ID                                        NUMBER
RATINGS                                            NUMBER
SALES                                              NUMBER

The next example adds the following property settings to change the length of the VARCHAR2 columns:

oracle.hadoop.exttab.colMap.CUST_ID.columnLength=12
oracle.hadoop.exttab.colMap.MOVIE_ID.columnLength=12
oracle.hadoop.exttab.colMap.GENRE_ID.columnLength=12

All columns now have custom data types:

SQL> DESCRIBE movie_facts_ext
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUST_ID                                            VARCHAR2(12)
MOVIE_ID                                           VARCHAR2(12)
GENRE_ID                                           VARCHAR2(12)
TIME_ID                                            TIMESTAMP(9)
RECOMMENDED                                        NUMBER
ACTIVITY_ID                                        NUMBER
RATINGS                                            NUMBER
SALES                                              NUMBER

2.10.2 What Are Location Files?

A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.

2.10.3 Enabling Parallel Processing

To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.

2.10.3.1 Setting Up the Degree of Parallelism

Ideally, you can decide to run at a particular degree of parallelism and create a number of location files that are a multiple of the degree of parallelism, as described in the following procedure.

To set up parallel processing for maximum performance:

  1. Identify the maximum DOP that your Oracle DBA will permit you to use when running Oracle SQL Connector for HDFS.

    When loading a huge amount of data into an Oracle database, you should also work with the DBA to identify a time when the maximum resources are available.

  2. Create a number of location files that is a small multiple of the DOP. For example, if the DOP is 8, then you might create 8, 16, 24, or 32 location files.

  3. Create a number of HDFS files that are about the same size and a multiple of the number of location files. For example, if you have 32 location files, then you might create 128, 1280, or more HDFS files, depending on the amount of data and the minimum HDFS file size.

  4. Set the DOP for the data load, using either the ALTER SESSION command or hints in the SQL SELECT statement.

    This example sets the DOP to 8 using ALTER SESSION:

    ALTER SESSION FORCE PARALLEL DML PARALLEL 8;
    ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8;
    

    The next example sets the DOP to 8 using the PARALLEL hint:

    INSERT /*+ parallel(my_db_table,8) */ INTO my_db_table \
       SELECT /*+ parallel(my_hdfs_external_table,8) */ * \
       FROM my_hdfs_external_table;
    

    An APPEND hint in the SQL INSERT statement can also help improve performance.

2.10.4 Location File Management

The Oracle SQL Connector for HDFS command-line tool, ExternalTable, creates an external table and publishes the HDFS URI information to location files. The external table location files are stored in the directory specified by the oracle.hadoop.exttab.defaultDirectory property. For an Oracle RAC database, this directory must reside on a distributed file system that is accessible to each database server.

ExternalTable manages the location files of the external table, which involves the following operations:

  • Generating new location files in the database directory after checking for name conflicts

  • Deleting existing location files in the database directory as necessary

  • Publishing data URIs to new location files

  • Altering the LOCATION clause of the external table to match the new location files

Location file management for the supported data sources is described in the following topics.

Data Pump File Format

The ORACLE_LOADER access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.

Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.

Delimited Files in HDFS and Hive Tables

The ORACLE_LOADER access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount configuration property.

See "Connections using url, user, and password Properties".

2.10.5 Location File Names

This is the format of a location file name:

osch-timestamp-number-n

In this syntax:

  • timestamp has the format yyyyMMddhhmmss, for example, 20121017103941 for October 17, 2012, at 10:39:41.

  • number is a random number used to prevent location file name conflicts among different tables.

  • n is an index used to prevent name conflicts between location files for the same table.

For example, osch-20121017103941-6807-1.

2.11 Configuring Oracle SQL Connector for HDFS

You can pass configuration properties to the ExternalTable tool on the command line with the -D option, or you can create a configuration file and pass it on the command line with the -conf option. These options must precede the command to be executed.

For example, this command uses a configuration file named example.xml:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/oracle/example.xml \
       -createTable

See "ExternalTable Command-Line Tool Syntax".

2.11.1 Creating a Configuration File

A configuration file is an XML document with a very simple structure as follows:

<?xml version="1.0"?>
<configuration>
  <property>
    <name>property</name>
    <value>value</value>
  </property>
     .
     .
     .
</configuration>

The following example shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.

Example 2-12 Configuration File for Oracle SQL Connector for HDFS

<?xml version="1.0"?>
<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>SH.SALES_EXT_DIR</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>/data/s1/*.csv,/data/s2/*.csv</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>SH</value>
  </property>
</configuration>

2.11.2 Oracle SQL Connector for HDFS Configuration Property Reference

The following is a complete list of the configuration properties used by the ExternalTable command-line tool. The properties are organized into these categories:

General Properties

Property Description

oracle.hadoop.exttab.colMap.columnLength

Specifies the length of all external table columns of type CHAR, VARCHAR2, NCHAR, NVARCHAR2, and RAW. Optional.

Default Value: The maximum length allowed by the column type

For Oracle Database 12c, Oracle SQL Connector for HDFS sets the length of VARCHAR2, NVARCHAR2, and RAW columns depending on whether the database MAX_STRING_SIZE option is set to STANDARD or EXTENDED.

Valid values: Integer

oracle.hadoop.exttab.colMap.columnType

Specifies the data type mapping of all columns for Hive and text sources. Optional.

You can override this setting for specific columns by setting oracle.hadoop.exttab.colMap.column_name.columnType.

Default value: VARCHAR2 for text; see Table 2-1 for Hive

Valid values: The following Oracle data types are supported:

  • VARCHAR2
  • NVARCHAR2
  • CHAR
  • NCHAR
  • CLOB
  • NCLOB
  • NUMBER
  • INTEGER
  • FLOAT
  • BINARY_DOUBLE
  • BINARY_FLOAT
  • RAW*
  • DATE
  • TIMESTAMP
  • TIMESTAMP WITH TIME ZONE
  • TIMESTAMP WITH LOCAL TIME ZONE
  • INTERVAL DAY TO SECOND
  • INTERVAL YEAR TO MONTH
  • * RAW binary data in delimited text files must be encoded in hexadecimal.

oracle.hadoop.exttab.colMap.dateMask

Specifies the format mask used in the date_format_spec clause of the external table for all DATE columns. This clause indicates that a character data field contains a date in the specified format.

Default value: The default globalization format mask, which is set by the NLS_DATE_FORMAT database parameter

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.fieldLength

Sets the character buffer length used by the ORACLE_LOADER access driver for all CLOB columns. The value is used in the field_list clause of the external table definition, which identifies the fields in the data file and their data types.

Default value: 4000 bytes

Valid values: Integer

oracle.hadoop.exttab.colMap.timestampMask

Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE columns. This clause indicates that a character data field contains a timestamp in the specified format.

Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_FORMAT database parameter

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.timestampTZMask

Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP WITH TIME ZONE columns. This clause indicates that a character data field contains a timestamp in the specified format.

Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_TZ_FORMAT database parameter

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.columnLength

Specifies the length of all external table columns of type CHAR, VARCHAR2, NCHAR, NVARCHAR2, and RAW. Optional.

Default Value: The value of oracle.hadoop.exttab.colMap.columnLength; if that property is not set, then the maximum length allowed by the data type

Valid values: Integer

oracle.hadoop.exttab.colMap.column_name.columnType

Overrides the data type mapping for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.columnType; if that property is not set, then the default data type identified in Table 2-1

Valid values: See oracle.hadoop.exttab.colMap.columnType

oracle.hadoop.exttab.colMap.column_name.dateMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.dateMask.

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.fieldLength

Overrides the character buffer length used by the ORACLE_LOADER access driver for column_name. This property is especially useful for CLOB and extended data type columns. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: Oracle SQL Connector for HDFS sets the default field lengths as shown in the following table.

Table 2-4 Field Length Calculations

Data Type of Target Column Field Length

VARCHAR2, NVARCHAR2, CHAR, NCHAR

Value of oracle.hadoop.exttab.colMap.column_name.columnLength

RAW

2 * columnLength property

CLOB, NCLOB

Value of oracle.hadoop.exttab.colMap.fieldLength

All other types

255 (default size for external tables)

Valid values: Integer

oracle.hadoop.exttab.colMap.column_name.nullIfSpecifier

This property is applied to a column identified by column_name in an external table. Optional. Overrides the property oracle.hadoop.exttab.nullIfSpecifier.

Type: string

Valid values: same as for the property oracle.hadoop.exttab.nullIfSpecifier.

Default values: none.

This property applies only to Delimited Text sources.

oracle.hadoop.exttab.colMap.column_name.timestampMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.timestampMask.

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.timestampTZMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.timestampTZMask.

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.columnCount

Specifies the number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property.

This property is ignored if oracle.hadoop.exttab.columnNames is set.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set either this property or oracle.hadoop.exttab.columnNames when creating an external table from delimited text files.

oracle.hadoop.exttab.columnNames

Specifies a comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the oracle.hadoop.exttab.columnCount property.

The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set either this property or oracle.hadoop.exttab.columnCount when creating an external table from delimited text files.

oracle.hadoop.exttab.dataCompressionCodec

Notice:

This property is deprecated.

OSCH discovers the compression code of each file at runtime. The dataset can contain both compressed and uncompressed files and can also contain files compressed with different codecs.

Specifies the name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional.

This property specifies the class name of any compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec interface. This codec applies to all data files.

Several standard codecs are available in Hadoop, including the following:

  • bzip2: org.apache.hadoop.io.compress.BZip2Codec

  • gzip: org.apache.hadoop.io.compress.GzipCodec

Default value: None

oracle.hadoop.exttab.dataPaths

Specifies a comma-separated list of fully qualified HDFS paths. This property enables you to restrict the input by using special pattern-matching characters in the path specification. See the following table. This property is required for the -createTable and -publish commands using Data Pump or delimited text files. The property is ignored for Hive data sources.

For example, to select all files in /data/s2/, and only the CSV files in /data/s7/, /data/s8/, and /data/s9/, enter this expression:

/data/s2/,/data/s[7-9]/*.csv

The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.

The data set can contain compressed files or uncompressed files, but not both.

Table 2-5 Pattern-Matching Characters

Character Description

?

Matches any single character

*

Matches zero or more characters

[abc]

Matches a single character from the character set {a, b, c}

[a-b]

Matches a single character from the character range {a...b}. The character a must be less than or equal to b.

[^a]

Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket.

\c

Removes any special meaning of character c. The backslash is the escape character.

{ab\,cd}

Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

{ab\,c{de\,fh}}

Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

oracle.hadoop.exttab.dataPathFilter

Specifies the path filter class. This property is ignored for Hive data sources.

Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered.

oracle.hadoop.exttab.defaultDirectory

Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object. In Oracle RAC, this directory must be on a shared directory accessible by all Oracle instances.

Valid value: The name of an existing database directory

Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names.

The -createTable command requires this property.

oracle.hadoop.exttab.fieldTerminator

Specifies the field terminator for an external table when oracle.hadoop.exttab.sourceType=text. Optional.

Default value: , (comma)

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u. For example, \t represents a tab.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. For example, \u0009 represents a tab. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.hive.columnType.*

Maps a Hive data type to an Oracle data type. The property name identifies the Hive data type, and its value is an Oracle data type. The target columns in the external table are created with the Oracle data type indicated by this property.

When Hive TIMESTAMP column is mapped to an Oracle TIMESTAMP column, then the format mask is YYYY-MM-DD H24:MI:SS.FF. When a Hive STRING column is mapped to an Oracle TIMESTAMP column, then the NLS parameter settings for the database are used by default. You can override these defaults by using either the oracle.hadoop.exttab.colMap.timestampMask or oracle.hadoop.exttab.colMap.timestampTZMask properties.

Default values: The following table lists the Hive column type properties and their default values.

Valid values: See the valid values for oracle.hadoop.exttab.colMap.columnType.

Table 2-6 Hive Column Type Mapping Properties

Property Default Value

oracle.hadoop.exttab.hive.columnType.BIGINT

INTEGER

oracle.hadoop.exttab.hive.columnType.BOOLEAN

VARCHAR2

oracle.hadoop.exttab.hive.columnType.DECIMAL

NUMBER

oracle.hadoop.exttab.hive.columnType.DOUBLE

NUMBER

oracle.hadoop.exttab.hive.columnType.FLOAT

NUMBER

oracle.hadoop.exttab.hive.columnType.INT

INTEGER

oracle.hadoop.exttab.hive.columnType.SMALLINT

INTEGER

oracle.hadoop.exttab.hive.columnType.STRING

VARCHAR2

oracle.hadoop.exttab.hive.columnType.TIMESTAMP

TIMESTAMP

oracle.hadoop.exttab.hive.columnType.TINYINT

INTEGER

oracle.hadoop.exttab.hive.databaseName

Specifies the name of a Hive database that contains the input data table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.hive.partitionFilter

Specifies a valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the table is not partitioned.

Type: String

Default value: None. All partitions of the Hive table are mapped to external tables.

Valid values: A valid HiveQL expression.

Description: Specifies a valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the Hive table is not partitioned. Including other columns does not raise an error, but unintended consequences can result. Oracle recommends that you exclude other columns.

The expression must conform to the following restrictions:

  • Selects partitions and not individual records inside the partitions.

  • Does not include columns that are not used to partition the table, because they might cause unintended consequences.

  • Does not include subqueries.

  • Does not include user-defined functions (UDFs). Built-in functions are supported.

  • Does not support Hive variable name spaces (such as env:, system:, hiveconf:, and hivevar:) because Hive variable expansion is disabled when OSCH processes this string. Expand all variables in Hive CLI before setting this property. For example:

    CREATE VIEW view_name AS SELECT * from database.table_name WHERE expression;
    DESCRIBE FORMATTED view_name;
    

    The View Original Text field contains the query with all variables expanded. Copy the where clause, starting after where.

    Since all variable expansions are resolved at the Hadoop level, define any Hadoop variables used in the expression using generic options (-D and -conf). Use the Hive CLI to test the expression and ensure that it returns the expected results. The following examples assume a source table defined with this command:

CREATE TABLE t(c string)
   PARTITIONED BY (p1 string, p2 int, p3 boolean, p4 string, p5 timestamp);

Example 1: Nested Expressions

p1 like 'abc%' or (p5 >= '2010-06-20' and p5 <= '2010-07-03')

Example 2: Built-in Functions

year(p5) = 2014

Example 3: Bad Usage: Columns That Are Not Used to Partition the Table

These examples show that using c, a column that is not used to partition the table, is unnecessary and can cause unexpected results.

This example is equivalent to p2 > 35:

p2 > 35 and c like 'abc%'

This example loads all partitions. All partitions could contain c like 'abc%, so partitions are filtered out:

p2 > 35 or c like 'abc%'

oracle.hadoop.exttab.hive.refreshTables

Only applies when the source is a Hive partitioned table, is ignored otherwise. This property specifies whether the -publish operation should refresh HDFS data paths in existing external tables when adding new external tables and views for new Hive partitions in the source.

Set this property to TRUE to enable refresh. Note that enabling such a refresh can slow down the -publish operation. If the existing Hive partitions in the source table have not changed, set the property to FALSE.

Default value: FALSE

oracle.hadoop.exttab.hive.tableName

Specifies the name of an existing Hive table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.hive.deleteObsoleteTables

Specifies whether the -publish operation should drop the views and external tables that do not map to any partition in the partitioned Hive table.   The property only applies when the source is a Hive-partitioned table and is otherwise ignored. This property is also ignored if the original -createTable operation for the Hive partitioned source table included the oracle.hadoop.exttab.hive.partitionFilter property

Set this property to TRUE to enable dropping obsolete objects.

Default value: FALSE

oracle.hadoop.exttab.initialFieldEncloser

Specifies the initial field encloser for an external table created from delimited text files. Optional.

Default value: null; no enclosers are specified for the external table definition.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.locationFileCount

Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files.

Default value: 4

This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:

  • The number of data files

  • The value of this property

At least one location file is created.

See "Enabling Parallel Processing" for more information about the number of location files.

oracle.hadoop.exttab.logDirectory

Specifies a database directory where log files and bad files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by _%p.log.

This is an optional property for the -createTable command.

These are the default file name extensions:

  • Log files: log

  • Bad files: bad

Valid values: An existing Oracle directory object name.

Unquoted names are changed to uppercase. Quoted names are not changed. The following table provides examples of how values are transformed.

Table 2-7 Examples of Quoted and Unquoted Values

Specified Value Interpreted Value

my_log_dir:'sales_tab_%p.log'

MY_LOG_DIR/sales_tab_%p.log

'my_log_dir':'sales_tab_%p.log'

my_log_dir/sales_tab_%p.log

"my_log_dir":"sales_tab_%p.log"

my_log_dir/sales_tab_%p.log

oracle.hadoop.exttab.nullIfSpecifier

Specifies the NULLIF clause of the external table definition. Optional.

This property is applied to all the columns in an external table.

Type: string

Valid values: a string in following formats:

  • One or more regular, printable characters, for example:\N

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. For example: \u000A represents a newline. The hexadecimal digits are case insensitive.

Default values: none.

This property applies only to Delimited Text sources.

See Also:

Example 2-1. This example shows how to use nullIfSpecifier when accessing HDFS data files From Oracle Database.

oracle.hadoop.exttab.preprocessorDirectory

Specifies the database directory for the preprocessor. The file-system directory must contain the hdfs_stream script.

Default value: OSCH_BIN_PATH

The preprocessor directory is used in the PREPROCESSOR clause of the external table.

oracle.hadoop.exttab.preprocessorScript

Specifies the name of the preprocessor script for the external table.

Default value: hdfs_stream

The preprocessor script name is used in the PREPROCESSOR clause of the external table. This property is required only for Oracle Database running on Microsoft Windows platforms and is optional for all other Oracle Database platforms. On Microsoft Windows, the value must be set to hdfs_stream.cmd.

oracle.hadoop.exttab.recordDelimiter

Specifies the record delimiter for an external table created from delimited text files. Optional.

Default value: \n

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.sourceType

Specifies the type of source files. The -createTable and -publish operations require the value of this property.

Default value: text

Valid values: datapump, hive, or text

oracle.hadoop.exttab.stringSizes

Indicates whether the lengths specified for character strings are bytes or characters. This value is used in the STRING SIZES ARE IN clause of the external table. Use characters when loading multibyte character sets. See Oracle Database Utilities.

Default value: BYTES

Valid values: BYTES or CHARACTERS

oracle.hadoop.exttab.createLogFiles

Specifies whether the log files should be created when the external tables are queried. Oracle recommends enabling log file creation during development and disabling log file creation during production for best performance.

Default value: TRUE

Log files are created by default. To stop creating log files you must drop the table, set this property to FALSE, and then recreate the table. Use the -drop and -createTable commands to drop and recreate the table.

oracle.hadoop.exttab.printVerbose

Specifies whether to print verbose reports on the console during a -publish operation. Set the value to TRUE to see verbose reports on the console. This property should only be used for debugging.

Default value: FALSE

oracle.hadoop.exttab.createBadFiles

Specifies whether bad files should be created when the external tables are queried. Bad files contain information on rows with bad data. Bad files are created only when there is bad data. Oracle recommends creating bad files.

Default value: TRUE

Bad files are created by default. To stop creating bad files you must drop the table, set this property to FALSE, and then recreate the table. Use the -drop and -createTable commands to drop and recreate the table.

This property applies only to Hive and Delimited Text sources.

oracle.hadoop.exttab.tableName

Specifies the metadata table for partitioned Hive tables or schema-qualified name of the external table for all other data sources, in this format:

schemaName.tableName

If you omit schemaName, then the schema name defaults to the connection user name.

Default value: none

Required property for all operations.

oracle.hadoop.exttab.trailingFieldEncloser

Specifies the trailing field encloser for an external table created from delimited text files. Optional.

Default value: null; defaults to the value of oracle.hadoop.exttab.initialFieldEncloser

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

Connections using url, user, and password Properties

The url, user, and password properties provide a distinct connection method. Do not mix these properties with those required for a connection using an Oracle Wallet.

Property Description

oracle.hadoop.connection.url

Specifies the database connection string in the thin-style service name format:

jdbc:oracle:thin:@//host_name:port/service_name

If you are unsure of the service name, then enter this SQL command as a privileged user:

SQL> show parameter service

This property takes precedence over all other connection properties.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.user

Specifies an Oracle database log-in name. The externalTable tool prompts for a password if the oracle.hadoop.connection.password is not specified .

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.password

Password for the Oracle Database user. Oracle recommends that you do not use this property to store a clear text password outside of non-sensitive test or demo environments. You can force a password prompt/response by excluding the password property from the connection. In that case, the externalTable tool prompts for the password. If you require a connection with no prompt/response, use the Oracle Wallet connection method described in the next section instead.

Default value: Not defined.

Valid values: A string

Connections Using Oracle Wallet

When using Oracle Wallet as an external password store, set the properties shown in the following table.

oracle.hadoop.connection.tnsEntryName

Specifies a TNS entry name defined in the tnsnames.ora file.

This property is used with the oracle.hadoop.connection.tns_admin property.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.tns_admin

Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java oracle.net.tns_admin property. When both are set, this property takes precedence over oracle.net.tns_admin.

This property must be set when using Oracle Wallet as an external password store.

Default value: The value of the Java oracle.net.tns_admin system property

Valid values: A string

oracle.hadoop.connection.wallet_location

Specifies a file path to an Oracle Wallet directory where the connection credential is stored.

Default value: Not defined

Valid values: A string

Tip:

Connections using Oracle Wallet can accommodate many TNS entries and are therefore recommended over those using the user, password and url properties which are restricted to a single machine/port/servicename combination.

For a simple step-by-step demonstration, see the posting Using Oracle SQL Connector for HDFS with Oracle Wallet in the Connecting Hadoop With Oracle blog.

2.12 Performance Tips for Querying Data in HDFS

Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are inserting data.

SeeMy Oracle Support Document 2111850.1 for additional details and examples for improving performance.