2 Oracle SQL Connector for Hadoop Distributed File System

This chapter describes how to use Oracle SQL Connector for Hadoop Distributed File System (HDFS) to facilitate data access between Hadoop and Oracle Database.

This chapter contains the following sections:

2.1 About Oracle SQL Connector for HDFS

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in Apache Hadoop in these formats:

  • Data Pump files in HDFS

  • Delimited text files in HDFS

  • Delimited text files in Apache Hive tables

For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Partitioned Hive tables are supported, enabling you to represent a subset of Hive table partitions in Oracle Database, instead of the entire Hive table.

Oracle SQL Connector for HDFS uses external tables and database views to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. Oracle SQL Connector for HDFS creates database views over external tables to support access to partitioned Hive tables. By querying the external tables or views, you can access data stored in HDFS and Hive tables as if that data were stored in tables in an Oracle database.

To create these objects in Oracle Database, you use the ExternalTable command-line tool provided with Oracle SQL Connector for HDFS. You provide ExternalTable with information about the data source in Hadoop and about your schema in an Oracle Database. You provide this information either as options to the ExternalTable command or in an XML file.

When the external table is ready, you can query the data the same as any other database table. You can query and join data in HDFS or a Hive table with other database-resident data.

You can also perform bulk loads of data into Oracle database tables using SQL. You may prefer that the data resides in an Oracle database—all of it or just a selection—if it is queried routinely.

2.2 Getting Started With Oracle SQL Connector for HDFS

The following list identifies the basic steps that you take when using Oracle SQL Connector for HDFS.

  1. Log in to a system where Oracle SQL Connector for HDFS is installed, which can be the Oracle Database system, a node in the Hadoop cluster, or a system set up as a remote client for the Hadoop cluster.

    See "Installing and Configuring a Hadoop Client on the Oracle Database System."

  2. The first time you use Oracle SQL Connector for HDFS, ensure that the software is configured.

    See "Configuring Your System for Oracle SQL Connector for HDFS." You might also need to edit hdfs_stream if your environment is unique. See "Installing Oracle SQL Connector for HDFS".

  3. If you are connecting to a secure cluster, then run kinit to authenticate yourself.

    See "Using Oracle SQL Connector for HDFS on a Secure Hadoop Cluster."

  4. Create an XML document describing the connections and the data source, unless you are providing these properties in the ExternalTable command.

    See "Exploring External Tables and Location Files."

  5. Create a shell script containing an ExternalTable command.

    See "Using the ExternalTable Command-Line Tool."

  6. Run the shell script.

  7. If the job fails, then use the diagnostic messages in the output to identify and correct the error. Depending on how far the job progressed before failing, you may need to delete the table definition from the Oracle database before rerunning the script.

  8. After the job succeeds, connect to Oracle Database as the owner of the external table. Query the table to ensure that the data is accessible.

  9. If the data will be queried frequently, then you may want to load it into a database table to improve querying performance. External tables do not have indexes or partitions.

    If you want the data to be compressed as it loads into the table, then create the table with the COMPRESS option.

  10. To delete the Oracle Database objects created by Oracle SQL Connector for HDFS, use the -drop command.

    See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

Example 2-1 illustrates these steps.

Example 2-1 Accessing HDFS Data Files from Oracle Database

$ cat moviefact_hdfs.sh
# Add environment variables
export OSCH_HOME="/u01/connectors/orahdfs-3.0.0"
 
hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/oracle/movies/moviefact_hdfs.xml \
       -createTable

$ cat moviefact_hdfs.xml
<?xml version="1.0"?>
 <configuration>
    <property>
      <name>oracle.hadoop.exttab.tableName</name>
      <value>MOVIE_FACTS_EXT</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.locationFileCount</name>
      <value>4</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.dataPaths</name>
      <value>/user/oracle/moviework/data/part*</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.fieldTerminator</name>
      <value>\u0009</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.defaultDirectory</name>
      <value>MOVIEDEMO_DIR</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.columnNames</name>
      <value>CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.TIME_ID.columnType</name>
      <value>TIMESTAMP</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.timestampMask</name>
      <value>YYYY-MM-DD:HH:MI:SS</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.RECOMMENDED.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.RATING.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.colMap.SALES.columnType</name>
      <value>NUMBER</value>
    </property>
    <property>
      <name>oracle.hadoop.exttab.sourceType</name>
      <value>text</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.url</name>
      <value>jdbc:oracle:thin:@localhost:1521:orcl</value>
    </property>
    <property>
      <name>oracle.hadoop.connection.user</name>
      <value>MOVIEDEMO</value>
    </property>  
</configuration>

$ sh moviefact_hdfs.sh
Oracle SQL Connector for HDFS Release 3.0.0 - Production
 
Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.
 
[Enter Database Password: password]
The create table command succeeded.
 
CREATE TABLE "MOVIEDEMO"."MOVIE_FACTS_EXT"
(
 "CUST_ID"                        VARCHAR2(4000),
 "MOVIE_ID"                       VARCHAR2(4000),
 "GENRE_ID"                       VARCHAR2(4000),
 "TIME_ID"                        TIMESTAMP(9),
 "RECOMMENDED"                    NUMBER,
 "ACTIVITY_ID"                    NUMBER,
 "RATING"                         NUMBER,
 "SALES"                          NUMBER
)
ORGANIZATION EXTERNAL
( 
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "MOVIEDEMO_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     CHARACTERSET AL32UTF8
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'09'
     MISSING FIELD VALUES ARE NULL
     (
       "CUST_ID" CHAR(4000),
       "MOVIE_ID" CHAR(4000),
       "GENRE_ID" CHAR(4000),
       "TIME_ID" CHAR,
       "RECOMMENDED" CHAR,
       "ACTIVITY_ID" CHAR,
       "RATING" CHAR,
       "SALES" CHAR
     )
   )
   LOCATION
   (
     'osch-20141114064206-5250-1',
     'osch-20141114064206-5250-2',
     'osch-20141114064206-5250-3',
     'osch-20141114064206-5250-4'
   )
) PARALLEL REJECT LIMIT UNLIMITED;
 
The following location files were created.
 
osch-20141114064206-5250-1 contains 1 URI, 12754882 bytes
 
    12754882 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00001
 
osch-20141114064206-5250-2 contains 1 URI, 438 bytes
 
         438 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00002
 
osch-20141114064206-5250-3 contains 1 URI, 432 bytes
 
         432 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00003
 
osch-20141114064206-5250-4 contains 1 URI, 202 bytes
 
         202 hdfs://localhost.localdomain:8020/user/oracle/moviework/data/part-00004

$ sqlplus moviedemo
 
SQL*Plus: Release 12.1.0.1.0 Production on Fri Apr 18 09:24:18 2014
 
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
 
Enter password: password
Last Successful login time: Thu Apr 17 2014 18:42:01 -05:00
 
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> DESCRIBE movie_facts_ext;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                            VARCHAR2(4000)
 MOVIE_ID                                           VARCHAR2(4000)
 GENRE_ID                                           VARCHAR2(4000)
 TIME_ID                                            TIMESTAMP(9)
 RECOMMENDED                                        NUMBER
 ACTIVITY_ID                                        NUMBER
 RATING                                             NUMBER
 SALES                                              NUMBER

SQL> CREATE TABLE movie_facts AS SELECT * FROM movie_facts_ext;
 
Table created.

SQL> SELECT movie_id, time_id, recommended, rating FROM movie_facts WHERE rownum < 5;
 
MOVIE_ID TIME_ID                          RECOMMENDED     RATING
-------- -------------------------------- ----------- ----------
205      03-DEC-10 03.14.54.000000000 AM            1          1
77       14-AUG-11 10.46.55.000000000 AM            1          3
116      24-NOV-11 05.43.00.000000000 AM            1          5
141      01-JAN-11 05.17.57.000000000 AM            1          4

2.3 Configuring Your System for Oracle SQL Connector for HDFS

You can run Oracle SQL Connector for HDFS on either the Oracle Database system or the Hadoop cluster:

  • For Hive sources, log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.

  • For text and Data Pump format files, log in to either the Oracle Database system or a node in the Hadoop cluster.

Oracle SQL Connector for HDFS requires additions to the HADOOP_CLASSPATH environment variable on the system where you log in. Your system administrator may have set them up for you when creating your account, or may have left that task for you. See "Setting Up User Accounts on the Oracle Database System".

Setting up the environment variables: 

  • Verify that HADOOP_CLASSPATH includes the path to the JAR files for Oracle SQL Connector for HDFS:

    path/orahdfs-3.0.0/jlib/*
    
  • If you are logged in to a Hadoop cluster with Hive data sources, then verify that HADOOP_CLASSPATH also includes the Hive JAR files and conf directory. For example:

    /usr/lib/hive/lib/*
    /etc/hive/conf
    
  • For your convenience, you can create an OSCH_HOME environment variable. The following is the Bash command for setting it on Oracle Big Data Appliance:

    $ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
    

See Also:

2.4 Using the ExternalTable Command-Line Tool

Oracle SQL Connector for HDFS provides a command-line tool named ExternalTable. This section describes the basic use of this tool. See "Creating External Tables" for the command syntax that is specific to your data source format.

2.4.1 About ExternalTable

The ExternalTable tool uses the values of several properties to do the following tasks:

  • Create an external table

  • Populate the location files

  • Publish location files to an existing external table

  • List the location files

  • Describe an external table

You can specify these property values in an XML document or individually on the command line. See "Configuring Oracle SQL Connector for HDFS.".

2.4.2 ExternalTable Command-Line Tool Syntax

This is the full syntax of the ExternalTable command-line tool, which is run using the hadoop command:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \ 
[-D property=value]... \
-createTable [--noexecute [--output filename.sql]] 
  | -drop [--noexecute]
  | -describe
  | -publish [--noexecute] 
  | -listlocations [--details]
  | -getDDL

You can either create the OSCH_HOME environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is:

/opt/oracle/orahdfs-version

For example, you might run the ExternalTable command-line tool with a command like this:

hadoop jar /opt/oracle/orahdfs-3.0.0/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
     .
     .
     .

Generic Options and User Commands 

-conf config_file

Identifies the name of an XML configuration file containing properties needed by the command being executed. See "Configuring Oracle SQL Connector for HDFS."

-D property=value

Assigns a value to a specific property.

-createTable [--noexecute [--output filename]]

Creates an external table definition and publishes the data URIs to the location files of the external table. The output report shows the DDL used to create the external table and lists the contents of the location files. Oracle SQL Connector for HDFS also checks the database to ensure that the required database directories exist and that you have the necessary permissions.

For partitioned Hive tables, Oracle SQL Connector for HDFS creates external tables, views, and a metadata table. See Table 2-2.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report includes the details of the execution plan and any errors. The --output option writes the table DDL from the -createTable command to a file. Oracle recommends that you first execute a -createTable command with --noexecute.

-drop [--noexecute]

Deletes one or more Oracle Database objects created by Oracle SQL Connector for HDFS to support a particular data source. Specify the metadata table name for partitioned Hive tables, or the external table name for all other data sources. An error occurs if you attempt to drop a table or view that Oracle SQL Connector for HDFS did not create.

Use the --noexecute option to list the objects to be deleted.

-describe

Provides information about the Oracle Database objects created by Oracle SQL Connector for HDFS. Use this command instead of -getDDL or -listLocations.

-publish [--noexecute]

Publishes the data URIs to the location files of an existing external table. Use this command after adding new data files, so that the existing external table can access them.

Use the --noexecute option to see the execution plan of the command. The operation is not executed, but the report shows the planned SQL ALTER TABLE command and location files. The report also shows any errors.

Oracle recommends that you first execute a -publish command with --noexecute.

See "Publishing the HDFS Data Paths."

-listLocations [--details]

Shows the location file content as text. With the --details option, this command provides a detailed listing. This command is deprecated in release 3.0. Use -describe instead.

-getDDL

Prints the table definition of an existing external table. This command is deprecated in release 3.0. Use -describe instead.

2.5 Creating External Tables

You can create external tables automatically using the ExternalTable tool provided in Oracle SQL Connector for HDFS.

2.5.1 Creating External Tables with the ExternalTable Tool

To create an external table using the ExternalTable tool, follow the instructions for your data source:

When the ExternalTable -createTable command finishes executing, the external table is ready for use. ExternalTable also manages the location files for the external table. See "Location File Management."

To create external tables manually, follow the instructions in "Creating External Tables in SQL."

ExternalTable Syntax for -createTable

Use the following syntax to create an external table and populate its location files:

hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-createTable [--noexecute]

2.5.2 Creating External Tables from Data Pump Format Files

Oracle SQL Connector for HDFS supports only Data Pump files produced by Oracle Loader for Hadoop, and does not support generic Data Pump files produced by Oracle Utilities.

Oracle SQL Connector for HDFS creates the external table definition for Data Pump files by using the metadata from the Data Pump file header. It uses the ORACLE_LOADER access driver with the preprocessor access parameter. It also uses a special access parameter named EXTERNAL VARIABLE DATA, which enables ORACLE_LOADER to read the Data Pump format files generated by Oracle Loader for Hadoop.

To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

Note:

Oracle SQL Connector for HDFS requires a patch to Oracle Database 11.2.0.2 before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com and search for bug 14557588.

Release 11.2.0.3 and later releases do not require this patch.

2.5.2.2 Optional Properties

This property is optional:

2.5.2.3 Defining Properties in XML Files for Data Pump Format Files

Example 2-2 is an XML template containing the properties that describe a Data Pump file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Data Pump file, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-2 XML Template with Properties for a Data Pump Format File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>datapump</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.logDirectory</name>
    <value>value</value>
  </property>
</configuration>

2.5.2.4 Example

Example 2-3 creates an external table named SALES_DP_XTAB to read Data Pump files.

Example 2-3 Defining an External Table for Data Pump Format Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory. For Oracle RAC, you must create a clusterwide directory on a distributed file system.

$ mkdir /data/sales_dp_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dp_dir AS '/data/sales_dp_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dp_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:///user/scott/olh_sales_dpoutput/ \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DP_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.3 Creating External Tables from Hive Tables

Oracle SQL Connector for HDFS creates the external table definition from a Hive table by contacting the Hive metastore client to retrieve information about the table columns and the location of the table data. In addition, the Hive table data paths are published to the location files of the Oracle external table.

To read Hive table metadata, Oracle SQL Connector for HDFS requires that the Hive JAR files are included in the HADOOP_CLASSPATH variable. Oracle SQL Connector for HDFS must be installed and running on a computer with a working Hive client.

Ensure that you add the Hive configuration directory to the HADOOP_CLASSPATH environment variable. You must have a correctly functioning Hive client.

For Hive managed tables, the data paths come from the warehouse directory.

For Hive external tables, the data paths from an external location in HDFS are published to the location files of the Oracle external table. Hive external tables can have no data, because Hive does not check if the external location is defined when the table is created. If the Hive table is empty, then one location file is published with just a header and no data URIs.

The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable tool to drop the existing external table and create a new one.

To delete the external tables and location files created by Oracle SQL Connector for HDFS, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

2.5.3.1 Hive Table Requirements

Oracle SQL Connector for HDFS supports nonpartitioned Hive tables that are defined using ROW FORMAT DELIMITED and FILE FORMAT TEXTFILE clauses. Both Hive-managed tables and Hive external tables are supported.

Oracle SQL Connector for HDFS also supports partitioned tables. See "Creating External Tables from Partitioned Hive Tables".

Hive tables can be either bucketed or not bucketed. All primitive types from Hive 0.10.0 are supported.

2.5.3.2 Data Type Mappings

Table 2-1 shows the default data-type mappings between Hive and Oracle. To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.hive.columnType.* properties listed under "Optional Properties.".

Table 2-1 Hive Data Type Mappings

Data Type of Source Hive Column Default Data Type of Target Oracle Column

INT, BIGINT, SMALLINT, TINYINT

INTEGER

DECIMAL

NUMBER

DECIMAL(p,s)

NUMBER(p,s)

DOUBLE, FLOAT

NUMBER

DATE

DATE with format mask YYYY-MM-DD

TIMESTAMP

TIMESTAMP with format mask YYYY-MM-DD HH24:MI:SS.FF

BOOLEAN

VARCHAR2(5)

CHAR(size)

CHAR(size)

STRING

VARCHAR2(4000)

VARCHAR

VARCHAR2(4000)

VARCHAR(size)

VARCHAR2(size)


2.5.3.5 Defining Properties in XML Files for Hive Tables

Example 2-4 is an XML template containing the properties that describe a Hive table. To use the template, cut and paste it into a text file, enter the appropriate values to describe your Hive table, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-4 XML Template with Properties for a Hive Table

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>hive</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.partitionFilter</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.databaseName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.hive.columnType.TYPE</name>
    <value>value</value>
  </property>
</configuration>

2.5.3.6 Example

Example 2-5 creates an external table named SALES_HIVE_XTAB to read data from a Hive table. The example defines all the properties on the command line instead of in an XML file.

Example 2-5 Defining an External Table for a Hive Table

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /data/sales_hive_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_hive_dir AS '/data/sales_hive_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_hive_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*:/usr/lib/hive/lib/*:/etc/hive/conf"

$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_HIVE_XTAB \
-D oracle.hadoop.exttab.sourceType=hive \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.hive.tableName=sales_country_us \
-D oracle.hadoop.exttab.hive.databaseName=salesdb \
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.4 Creating External Tables from Partitioned Hive Tables

Oracle SQL Connector for HDFS supports partitioned Hive tables, enabling you to query a single partition, a range of partitions, or all partitions. You can represent all Hive partitions or a subset of them in Oracle Database.

See Also:

"Creating External Tables from Hive Tables" for required properties, data type mappings, and other details applicable to all Hive table access using Oracle SQL Connector for HDFS.

2.5.4.1 Database Objects that Support Access to Partitioned Hive Tables

To support a partitioned Hive table, Oracle SQL Connector for HDFS creates the objects described in Table 2-2.

Table 2-2 Oracle Database Objects for Supporting a Partitioned Hive Table

Database Object Description Naming Convention

External Tables

One for each Hive partition

OSCHtable_name_n

For example, OSCHDAILY_1 and OSCHDAILY_2

Views

One for each external table. Used for querying the Hive data.

table_name_n

For example, DAILY_1 and DAILY_2

Metadata Table

One for the Hive table. Identifies all external tables and views associated with a particular Hive table. Specify this table when creating, describing, or dropping these database objects.

table_name

For example, DAILY


For example, if a Hive table comprises five partitions, then Oracle SQL Connector for HDFS creates five external tables, five views, and one metadata table in Oracle Database.

To drop the objects described in Table 2-2 and the location files, use the -drop command. See "Dropping Database Objects Created by Oracle SQL Connector for HDFS".

2.5.4.2 Querying the Metadata Table

The metadata table provides critical information about how to query the Hive table. Table 2-3 describes the columns of a metadata table.

Table 2-3 Metadata Table Columns

Column Description

VIEW_NAME

The view in Oracle Database used to access a partition of a Hive table

EXT_TABLE_NAME

The external table in Oracle Database used to access the data in a Hive partition.

HIVE_TABLE_NAME

The partitioned Hive table being accessed through Oracle Database.

HIVE_DB_NAME

The Hive database where the table resides.

HIVE­_PART_FILTER

The Hive partition filter used to select a subset of partitions for access by Oracle Database. A NULL value indicates that all partitions are accessed.

Partition Columns

Each column used to partition the Hive table has a separate column in the metadata table. For example, the metadata table has columns for COUNTRY, STATE, and CITY for a Hive table partitioned by a combination of COUNTRY, STATE, and CITY values.


The following SELECT statement queries a metadata table named HIVE_SALES_DATA:

SQL> SELECT view_name, ext_table_name, Hive_table_name, \
      hive_db_name, country, city \
      FROM hive_sales_data \
      WHERE state = 'TEXAS';

The results of the query identify three views with data from cities in Texas:

VIEW_NAME          EXT_TABLE_NAME         HIVE_TABLE_NAME   HIVE_DB_NAME  COUNTRY  CITY 
------------------------------------------------------------------------------------------
HIVE_SALES_DATA_1  OSCHHIVE_SALES_DATA_1  hive_sales_data   db_sales      US       AUSTIN
HIVE_SALES_DATA_2  OSCHHIVE_SALES_DATA_2  hive_sales_data   db_sales      US       HOUSTON
HIVE_SALES_DATA_3  OSCHHIVE_SALES_DATA_3  hive_sales_data   db_sales      US       DALLAS

2.5.4.3 Creating UNION ALL Views for Querying

To facilitate querying, you can create UNION ALL views over the individual partition views. Use the mkhive_unionall_view.sql script, which is provided in the OSCH_HOME/example/sql directory. To maintain performance, do not create UNION ALL views over more than 50 to 100 views (depending on their size).

To use mkhive_unionall_view.sql, use the following syntax:

@mkhive_unionall_view[.sql] table schema view predicate

MKHIVE_UNIONALL_VIEW Script Parameters 

table

The name of the metadata table in Oracle Database that represents a partitioned Hive table. Required.

schema

The owner of the metadata table. Optional; defaults to your schema.

view

The name of the UNION ALL view created by the script. Optional; defaults to table_ua.

predicate

A WHERE condition used to select the partitions in the Hive table to include in the UNION ALL view. Optional; defaults to all partitions.

Example 2-6 Union All Views for Partitioned Hive Tables

The following example creates a UNION ALL view named HIVE_SALES_DATA_UA, which accesses all partitions listed in the HIVE_SALES_DATA metadata table:

SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null null null
 

This example creates a UNION ALL view named ALL_SALES, which accesses all partitions listed in the HIVE_SALES_DATA metadata table:

SQL> @mkhive_unionall_view.sql HIVE_SALES_DATA null ALL_SALES null
 

The next example creates a UNION ALL view named TEXAS_SALES_DATA, which accesses the rows of all partitions where STATE = 'TEXAS'.

SQL> @mkhive_unionallview.sql HIVE_SALES_DATA null TEXAS_SALES_DATA '(STATE = ''''TEXAS'''')'

2.5.4.4 Error Messages

table name too long, max limit length
Cause: The names generated for the database objects exceed 30 characters.
Action: Specify a name that does not exceed 24 characters in the oracle.hadoop.exttab.tableName property. Oracle SQL Connector for HDFS generates external table names using the convention OSCHtable_name_n. See Table 2-2.
table/view names containing string table_name found in schema schema_name
Cause: An attempt was made to create external tables for a partitioned Hive table, but the data objects already exist.
Action: Use the hadoop -drop command to drop the existing tables and views, and then retry the -createTable command. If this solution fails, then you might have "dangling" objects. See "Dropping Dangling Objects".

2.5.4.5 Dropping Dangling Objects

Always use Oracle SQL Connector for HDFS commands to manage objects created by the connector to support partitioned Hive tables. Dangling objects are caused when you use the SQL drop table command to drop a metadata table instead of the -drop command. If you are unable to drop the external tables and views for a partitioned Hive table, then they are dangling objects.

Notice the schema and table names in the error message generated when you attempted to drop the objects, and use them in the following procedure.

To drop dangling database objects: 

  1. Open a SQL session with Oracle Database, and connect as the owner of the dangling objects.

  2. Identify the location files of the external table by querying the ALL_EXTERNAL_LOCATIONS and ALL_EXTERNAL_TABLES data dictionary views:

    SELECT a.table_name, a.directory_name, a.location \
    FROM all_external_locations a, all_external_tables b \
    WHERE a.table_name = b.table_name AND a.table_name \
    LIKE 'OSCHtable%' AND a.owner='schema'; 
     
    

    In the LIKE clause of the previous syntax, replace table and schema with the appropriate values.

    In the output, the location file names have an osch- prefix, such as osch-20140408014604-175-1.

  3. Identify the external tables by querying the ALL_EXTERNAL_TABLES data dictionary view:

    SELECT table_name FROM all_external_tables \
    WHERE table_name \
    LIKE 'OSCHtable%' AND owner=schema;
     
    
  4. Identify the database views by querying the ALL_VIEWS data dictionary view:

    SELECT view_name FROM all_views 
    WHERE view_name 
    LIKE 'table%' AND owner='schema';
     
    
  5. Inspect the tables, views, and location files to verify that they are not needed, using commands like the following:

    DESCRIBE schema.table;
    SELECT * FROM schema.table;
    
    DESCRIBE schema.view;
    SELECT * FROM schema.view;
    
  6. Delete the location files, tables, and views that are not needed, using commands like the following:

    EXECUTE utl_file.fremove('directory', 'location_file');
    
    DROP TABLE schema.table;
    DROP VIEW schema.view;
    

2.5.5 Creating External Tables from Delimited Text Files

Oracle SQL Connector for HDFS creates the external table definition for delimited text files using configuration properties that specify the number of columns, the text delimiter, and optionally, the external table column names. By default, all text columns in the external table are VARCHAR2. If column names are not provided, they default to C1 to Cn, where n is the number of columns specified by the oracle.hadoop.exttab.columnCount property.

2.5.5.1 Data Type Mappings

All text data sources are automatically mapped to VARCHAR2(4000). To change the data type of the target columns created in the Oracle external table, set the oracle.hadoop.exttab.colMap.* properties listed under "Optional Properties."

2.5.5.4 Defining Properties in XML Files for Delimited Text Files

Example 2-7 is an XML template containing all the properties that describe a delimited text file. To use the template, cut and paste it into a text file, enter the appropriate values to describe your data files, and delete any optional properties that you do not need. For more information about using XML templates, see "Creating a Configuration File."

Example 2-7 XML Template with Properties for a Delimited Text File

<?xml version="1.0"?>

<!-- Required Properties -->

<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>value</value>
  </property>

<!-- Use either columnCount or columnNames -->

  <property>
    <name>oracle.hadoop.exttab.columnCount</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.columnNames</name>
    <value>value</value>
  </property>

  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>value</value>
  </property>

<!-- Optional Properties -->

  <property>
    <name>oracle.hadoop.exttab.colMap.TYPE</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.recordDelimiter</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.fieldTerminator</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.initialFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.trailingFieldEncloser</name>
    <value>value</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.locationFileCount</name>
    <value>value</value>
  </property>
</configuration>

2.5.5.5 Example

Example 2-8 creates an external table named SALES_DT_XTAB from delimited text files.

Example 2-8 Defining an External Table for Delimited Text Files

Log in as the operating system user that Oracle Database runs under (typically the oracle user), and create a file-system directory:

$ mkdir /data/sales_dt_dir

Create a database directory and grant read and write access to it:

$ sqlplus / as sysdba
SQL> CREATE OR REPLACE DIRECTORY sales_dt_dir AS '/data/sales_dt_dir'
SQL> GRANT READ, WRITE ON DIRECTORY sales_dt_dir TO scott;

Create the external table:

$ export OSCH_HOME="/opt/oracle/orahdfs-3.0.0"
$ export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"

$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DT_XTAB \
-D oracle.hadoop.exttab.locationFileCount=2 \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/scott/olh_sales/*.dat" \
-D oracle.hadoop.exttab.columnCount=10 \
-D oracle.hadoop.exttab.defaultDirectory=SALES_DT_DIR \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=SCOTT \
-createTable

2.5.6 Creating External Tables in SQL

You can create an external table manually for Oracle SQL Connector for HDFS. For example, the following procedure enables you to use external table syntax that is not exposed by the ExternalTable -createTable command.

Additional syntax might not be supported for Data Pump format files.

To create an external table manually:  

  1. Use the -createTable --noexecute command to generate the external table DDL.

  2. Make whatever changes are needed to the DDL.

  3. Run the DDL from Step 2 to create the table definition in the Oracle database.

  4. Use the ExternalTable -publish command to publish the data URIs to the location files of the external table.

2.6 Publishing the HDFS Data Paths

The -createTable command creates the metadata in Oracle Database for delimited text and Data Pump sources, and populates the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS.You might publish the URIs as a separate step from creating the external table in cases like these:

  • You want to publish new data into an already existing external table.

  • You created the external table manually instead of using the ExternalTable tool.

In both cases, you can use ExternalTable with the -publish command to populate the external table location files with the URIs of the data files in HDFS. See "Location File Management".

Note:

The publish option is supported for delimited text and Data Pump sources. It is not supported for Hive sources. Use the -drop and -createTable commands of the ExternalTable tool for Hive sources.

2.6.1 ExternalTable Syntax for Publish

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-publish [--noexecute]

2.6.2 ExternalTable Example for Publish

Example 2-9 sets HADOOP_CLASSPATH and publishes the HDFS data paths to the external table created in Example 2-3. See "Configuring Your System for Oracle SQL Connector for HDFS" for more information about setting this environment variable.

Example 2-9 Publishing HDFS Data Paths to an External Table for Data Pump Format Files

This example uses the Bash shell.

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.exttab.sourceType=datapump \
-D oracle.hadoop.exttab.dataPaths=hdfs:/user/scott/data/ \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -publish

In this example:

  • OSCH_HOME is the full path to the Oracle SQL Connector for HDFS installation directory.

  • SALES_DP_XTAB is the external table created in Example 2-3.

  • hdfs:/user/scott/data/ is the location of the HDFS data.

  • @myhost:1521 is the database connection string.

2.7 Exploring External Tables and Location Files

The -describe command is a debugging and diagnostic utility that prints the definition of an existing external table. It also enables you to see the location file metadata and contents. You can use this command to verify the integrity of the location files of an Oracle external table.

These properties are required to use this command:

2.7.1 ExternalTable Syntax for Describe

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-describe

2.7.2 ExternalTable Example for Describe

Example 2-10 shows the command syntax to describe the external tables and location files associated with SALES_DP_XTAB.

Example 2-10 Exploring External Tables and Location Files

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -describe

2.8 Dropping Database Objects Created by Oracle SQL Connector for HDFS

The -drop command deletes the database objects created by Oracle SQL Connector for HDFS. These objects include external tables, location files, and views. When dropping objects associated with a partitioned Hive table, problems arise if you delete objects manually, as described in "Dropping Dangling Objects". However, -drop also discards single external tables and location files that support other data source types.

The -drop command only deletes objects created by Oracle SQL Connector for HDFS.

These properties are required to use this command:

2.8.1 ExternalTable Syntax for Drop

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
[-conf config_file]... \
[-D property=value]... \
-drop

2.8.2 ExternalTable Example for Drop

Example 2-10 shows the command syntax to drop the database objects associated with SALES_DP_XTAB.

Example 2-11 Dropping Database Objects

$ export HADOOP_CLASSPATH="OSCH_HOME/jlib/*"
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=SALES_DP_XTAB \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \
-D oracle.hadoop.connection.user=scott -drop

2.9 More About External Tables Generated by the ExternalTable Tool

Because external tables are used to access data, all of the features and limitations of external tables apply. Queries are executed in parallel with automatic load balancing. However, update, insert, and delete operations are not allowed and indexes cannot be created on external tables. When an external table is accessed, a full table scan is always performed.

Oracle SQL Connector for HDFS uses the ORACLE_LOADER access driver. The hdfs_stream preprocessor script (provided with Oracle SQL Connector for HDFS) modifies the input data to a format that ORACLE_LOADER can process.

See Also:

2.9.1 About Configurable Column Mappings

Oracle SQL Connector for HDFS uses default data type mappings to create columns in an Oracle external table with the appropriate data types for the Hive and text sources. You can override these defaults by setting various configuration properties, for either all columns or a specific column.

For example, a field in a text file might contain a timestamp. By default, the field is mapped to a VARCHAR2 column. However, you can specify a TIMESTAMP column and provide a datetime mask to cast the values correctly into the TIMESTAMP data type. The TIMESTAMP data type supports time-based queries and analysis that are unavailable when the data is presented as text.

2.9.1.1 Default Column Mappings

Text sources are mapped to VARCHAR2 columns, and Hive columns are mapped to columns with the closest equivalent Oracle data type. Table 2-1 shows the default mappings.

2.9.1.2 All Column Overrides

The following properties apply to all columns in the external table. For Hive sources, these property settings override the oracle.hadoop.exttab.hive.* property settings.

2.9.1.4 Mapping Override Examples

The following properties create an external table in which all columns are the default VARCHAR2 data type:

oracle.hadoop.exttab.tableName=MOVIE_FACT_EXT_TAB_TXT
oracle.hadoop.exttab.columnNames=CUST_ID,MOVIE_ID,GENRE_ID,TIME_ID,RECOMMENDED,ACTIVITY_ID,RATING,SALES

In this example, the following properties are set to override the data type of several columns:

oracle.hadoop.exttab.colMap.TIME_ID.columnType=TIMESTAMP
oracle.hadoop.exttab.colMap.RECOMMENDED.columnType=NUMBER
oracle.hadoop.exttab.colMap.ACTIVITY_ID.columnType=NUMBER
oracle.hadoop.exttab.colMap.RATING.columnType=NUMBER
oracle.hadoop.exttab.colMap.SALES.columnType=NUMBER

Oracle SQL Connector for HDFS creates an external table with the specified data types:

SQL> DESCRIBE movie_facts_ext
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUST_ID                                            VARCHAR2(4000)
MOVIE_ID                                           VARCHAR2(4000)
GENRE_ID                                           VARCHAR2(4000)
TIME_ID                                            TIMESTAMP(9)
RECOMMENDED                                        NUMBER
ACTIVITY_ID                                        NUMBER
RATINGS                                            NUMBER
SALES                                              NUMBER

The next example adds the following property settings to change the length of the VARCHAR2 columns:

oracle.hadoop.exttab.colMap.CUST_ID.columnLength=12
oracle.hadoop.exttab.colMap.MOVIE_ID.columnLength=12
oracle.hadoop.exttab.colMap.GENRE_ID.columnLength=12

All columns now have custom data types:

SQL> DESCRIBE movie_facts_ext
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUST_ID                                            VARCHAR2(12)
MOVIE_ID                                           VARCHAR2(12)
GENRE_ID                                           VARCHAR2(12)
TIME_ID                                            TIMESTAMP(9)
RECOMMENDED                                        NUMBER
ACTIVITY_ID                                        NUMBER
RATINGS                                            NUMBER
SALES                                              NUMBER

2.9.2 What Are Location Files?

A location file is a file specified in the location clause of the external table. Oracle SQL Connector for HDFS creates location files that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS.

2.9.3 Enabling Parallel Processing

To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.

2.9.3.1 Setting Up the Degree of Parallelism

Ideally, you can decide to run at a particular degree of parallelism and create a number of location files that are a multiple of the degree of parallelism, as described in the following procedure.

To set up parallel processing for maximum performance: 

  1. Identify the maximum DOP that your Oracle DBA will permit you to use when running Oracle SQL Connector for HDFS.

    When loading a huge amount of data into an Oracle database, you should also work with the DBA to identify a time when the maximum resources are available.

  2. Create a number of location files that is a small multiple of the DOP. For example, if the DOP is 8, then you might create 8, 16, 24, or 32 location files.

  3. Create a number of HDFS files that are about the same size and a multiple of the number of location files. For example, if you have 32 location files, then you might create 128, 1280, or more HDFS files, depending on the amount of data and the minimum HDFS file size.

  4. Set the DOP for the data load, using either the ALTER SESSION command or hints in the SQL SELECT statement.

    This example sets the DOP to 8 using ALTER SESSION:

    ALTER SESSION FORCE PARALLEL DML PARALLEL 8;
    ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8;
    

    The next example sets the DOP to 8 using the PARALLEL hint:

    INSERT /*+ parallel(my_db_table,8) */ INTO my_db_table \
       SELECT /*+ parallel(my_hdfs_external_table,8) */ * \
       FROM my_hdfs_external_table;
    

    An APPEND hint in the SQL INSERT statement can also help improve performance.

2.9.4 Location File Management

The Oracle SQL Connector for HDFS command-line tool, ExternalTable, creates an external table and publishes the HDFS URI information to location files. The external table location files are stored in the directory specified by the oracle.hadoop.exttab.defaultDirectory property. For an Oracle RAC database, this directory must reside on a distributed file system that is accessible to each database server.

ExternalTable manages the location files of the external table, which involves the following operations:

  • Generating new location files in the database directory after checking for name conflicts

  • Deleting existing location files in the database directory as necessary

  • Publishing data URIs to new location files

  • Altering the LOCATION clause of the external table to match the new location files

Location file management for the supported data sources is described in the following topics.

Data Pump File Format

The ORACLE_LOADER access driver is required to access Data Pump files. The driver requires that each location file corresponds to a single Data Pump file in HDFS. Empty location files are not allowed, and so the number of location files in the external table must exactly match the number of data files in HDFS.

Oracle SQL Connector for HDFS automatically takes over location file management and ensures that the number of location files in the external table equals the number of Data Pump files in HDFS.

Delimited Files in HDFS and Hive Tables

The ORACLE_LOADER access driver has no limitation on the number of location files. Each location file can correspond to one or more data files in HDFS. The number of location files for the external table is suggested by the oracle.hadoop.exttab.locationFileCount configuration property.

See "Connection Properties".

2.9.5 Location File Names

This is the format of a location file name:

osch-timestamp-number-n

In this syntax:

  • timestamp has the format yyyyMMddhhmmss, for example, 20121017103941 for October 17, 2012, at 10:39:41.

  • number is a random number used to prevent location file name conflicts among different tables.

  • n is an index used to prevent name conflicts between location files for the same table.

For example, osch-20121017103941-6807-1.

2.10 Configuring Oracle SQL Connector for HDFS

You can pass configuration properties to the ExternalTable tool on the command line with the -D option, or you can create a configuration file and pass it on the command line with the -conf option. These options must precede the command to be executed.

For example, this command uses a configuration file named example.xml:

hadoop jar OSCH_HOME/jlib/orahdfs.jar \
       oracle.hadoop.exttab.ExternalTable \
       -conf /home/oracle/example.xml \
       -createTable

See "ExternalTable Command-Line Tool Syntax".

2.10.1 Creating a Configuration File

A configuration file is an XML document with a very simple structure as follows:

<?xml version="1.0"?>
<configuration>
  <property>
    <name>property</name>
    <value>value</value>
  </property>
     .
     .
     .
</configuration>

Example 2-12 shows a configuration file. See "Oracle SQL Connector for HDFS Configuration Property Reference" for descriptions of these properties.

Example 2-12 Configuration File for Oracle SQL Connector for HDFS

<?xml version="1.0"?>
<configuration>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>SH.SALES_EXT_DIR</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>/data/s1/*.csv,/data/s2/*.csv</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataCompressionCodec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>SH</value>
  </property>
</configuration>

2.10.2 Oracle SQL Connector for HDFS Configuration Property Reference

The following is a complete list of the configuration properties used by the ExternalTable command-line tool. The properties are organized into these categories:

General Properties 

oracle.hadoop.exttab.colMap.columnLength

Specifies the length of all external table columns of type CHAR, VARCHAR2, NCHAR, NVARCHAR2, and RAW. Optional.

Default Value: The maximum length allowed by the column type

For Oracle Database 12c, Oracle SQL Connector for HDFS sets the length of VARCHAR2, NVARCHAR2, and RAW columns depending on whether the database MAX_STRING_SIZE option is set to STANDARD or EXTENDED.

Valid values: Integer

oracle.hadoop.exttab.colMap.columnType

Specifies the data type mapping of all columns for Hive and text sources. Optional.

You can override this setting for specific columns by setting oracle.hadoop.exttab.colMap.column_name.columnType.

Default value: VARCHAR2 for text; see Table 2-1 for Hive

Valid values: The following Oracle data types are supported:


VARCHAR2
NVARCHAR2
CHAR
NCHAR
CLOB
NCLOB
NUMBER
INTEGER
FLOAT
BINARY_DOUBLE
BINARY_FLOAT
RAW*
DATE
TIMESTAMP
TIMESTAMP WITH TIME ZONE
TIMESTAMP WITH LOCAL TIME ZONE
INTERVAL DAY TO SECOND
INTERVAL YEAR TO MONTH
* RAW binary data in delimited text files must be encoded in hexadecimal.
oracle.hadoop.exttab.colMap.dateMask

Specifies the format mask used in the date_format_spec clause of the external table for all DATE columns. This clause indicates that a character data field contains a date in the specified format.

Default value: The default globalization format mask, which is set by the NLS_DATE_FORMAT database parameter

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.fieldLength

Sets the character buffer length used by the ORACLE_LOADER access driver for all CLOB columns. The value is used in the field_list clause of the external table definition, which identifies the fields in the data file and their data types.

Default value: 4000 bytes

Valid values: Integer

oracle.hadoop.exttab.colMap.timestampMask

Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE columns. This clause indicates that a character data field contains a timestamp in the specified format.

Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_FORMAT database parameter

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.timestampTZMask

Specifies the format mask used in the date_format_spec clause of the external table for all TIMESTAMP WITH TIME ZONE columns. This clause indicates that a character data field contains a timestamp in the specified format.

Default value: The default globalization format mask, which is set by the NLS_TIMESTAMP_TZ_FORMAT database parameter

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.columnLength

Specifies the length of all external table columns of type CHAR, VARCHAR2, NCHAR, NVARCHAR2, and RAW. Optional.

Default Value: The value of oracle.hadoop.exttab.colMap.columnLength; if that property is not set, then the maximum length allowed by the data type

Valid values: Integer

oracle.hadoop.exttab.colMap.column_name.columnType

Overrides the data type mapping for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.columnType; if that property is not set, then the default data type identified in Table 2-1

Valid values: See oracle.hadoop.exttab.colMap.columnType

oracle.hadoop.exttab.colMap.column_name.dateMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.dateMask.

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.fieldLength

Overrides the character buffer length used by the ORACLE_LOADER access driver for column_name. This property is especially useful for CLOB and extended data type columns. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: Oracle SQL Connector for HDFS sets the default field lengths as shown in Table 2-4.

Table 2-4 Field Length Calculations

Data Type of Target Column Field Length

VARCHAR2, NVARCHAR2, CHAR, NCHAR

Value of oracle.hadoop.exttab.colMap.column_name.columnLength

RAW

2 * columnLength property

CLOB, NCLOB

Value of oracle.hadoop.exttab.colMap.fieldLength

All other types

255 (default size for external tables)


Valid values: Integer

oracle.hadoop.exttab.colMap.column_name.timestampMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.timestampMask.

Valid values: A datetime format model as described in the Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.colMap.column_name.timestampTZMask

Overrides the format mask for column_name. Optional.

The column_name is case-sensitive. It must exactly match the name of a column in a Hive table or a column listed in oracle.hadoop.exttab.columnNames.

Default value: The value of oracle.hadoop.exttab.colMap.timestampTZMask.

Valid values: A datetime format model as described in Oracle Database SQL Language Reference. However, it cannot contain quotation marks.

oracle.hadoop.exttab.columnCount

The number of columns for the external table created from delimited text files. The column names are set to C1, C2,... Cn, where n is value of this property.

This property is ignored if oracle.hadoop.exttab.columnNames is set.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set either this property or oracle.hadoop.exttab.columnNames when creating an external table from delimited text files.

oracle.hadoop.exttab.columnNames

A comma-separated list of column names for an external table created from delimited text files. If this property is not set, then the column names are set to C1, C2,... Cn, where n is the value of the oracle.hadoop.exttab.columnCount property.

The column names are read as SQL identifiers: unquoted values are capitalized, and double-quoted values stay exactly as entered.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

You must set either this property or oracle.hadoop.exttab.columnCount when creating an external table from delimited text files.

oracle.hadoop.exttab.dataCompressionCodec

The name of the compression codec class used to decompress the data files. Specify this property when the data files are compressed. Optional.

This property specifies the class name of any compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec interface. This codec applies to all data files.

Several standard codecs are available in Hadoop, including the following:

  • bzip2: org.apache.hadoop.io.compress.BZip2Codec

  • gzip: org.apache.hadoop.io.compress.GzipCodec

To use codecs that may not be available on your Hadoop cluster (such as Snappy), you must first download, install, and configure them individually on your system.

Default value: None

oracle.hadoop.exttab.dataPaths

A comma-separated list of fully qualified HDFS paths. This property enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-5. This property is required for the -createTable and -publish commands using Data Pump or delimited text files. The property is ignored for Hive data sources.

For example, to select all files in /data/s2/, and only the CSV files in /data/s7/, /data/s8/, and /data/s9/, enter this expression:

/data/s2/,/data/s[7-9]/*.csv

The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.

The data set can contain compressed files or uncompressed files, but not both.

Table 2-5 Pattern-Matching Characters

Character Description

?

Matches any single character

*

Matches zero or more characters

[abc]

Matches a single character from the character set {a, b, c}

[a-b]

Matches a single character from the character range {a...b}. The character a must be less than or equal to b.

[^a]

Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket.

\c

Removes any special meaning of character c. The backslash is the escape character.

{ab\,cd}

Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

{ab\,c{de\,fh}}

Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.


oracle.hadoop.exttab.dataPathFilter

The path filter class. This property is ignored for Hive data sources.

Oracle SQL Connector for HDFS uses a default filter to exclude hidden files, which begin with a dot or an underscore. If you specify another path filter class using the this property, then your filter acts in addition to the default filter. Thus, only visible files accepted by your filter are considered.

oracle.hadoop.exttab.defaultDirectory

Specifies the default directory for the Oracle external table. This directory is used for all input and output files that do not explicitly name a directory object.

Valid value: The name of an existing database directory

Unquoted names are changed to upper case. Double-quoted names are not changed; use them when case-sensitivity is desired. Single-quoted names are not allowed for default directory names.

The -createTable command requires this property.

oracle.hadoop.exttab.fieldTerminator

Specifies the field terminator for an external table when oracle.hadoop.exttab.sourceType=text. Optional.

Default value: , (comma)

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u. For example, \t represents a tab.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. For example, \u0009 represents a tab. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.hive.columnType.*

Maps a Hive data type to an Oracle data type. The property name identifies the Hive data type, and its value is an Oracle data type. The target columns in the external table are created with the Oracle data type indicated by this property.

When Hive TIMESTAMP column is mapped to an Oracle TIMESTAMP column, then the format mask is YYYY-MM-DD H24:MI:SS.FF. When a Hive STRING column is mapped to an Oracle TIMESTAMP column, then the NLS parameter settings for the database are used by default. You can override these defaults by using either the oracle.hadoop.exttab.colMap.timestampMask or oracle.hadoop.exttab.colMap.timestampTZMask properties.

Default values: Table 2-6 lists the Hive column type properties and their default values.

Valid values: See the valid values for oracle.hadoop.exttab.colMap.columnType.

Table 2-6 Hive Column Type Mapping Properties

Property Default Value

oracle.hadoop.exttab.hive.columnType.BIGINT

INTEGER

oracle.hadoop.exttab.hive.columnType.BOOLEAN

VARCHAR2

oracle.hadoop.exttab.hive.columnType.DECIMAL

NUMBER

oracle.hadoop.exttab.hive.columnType.DOUBLE

NUMBER

oracle.hadoop.exttab.hive.columnType.FLOAT

NUMBER

oracle.hadoop.exttab.hive.columnType.INT

INTEGER

oracle.hadoop.exttab.hive.columnType.SMALLINT

INTEGER

oracle.hadoop.exttab.hive.columnType.STRING

VARCHAR2

oracle.hadoop.exttab.hive.columnType.TIMESTAMP

TIMESTAMP

oracle.hadoop.exttab.hive.columnType.TINYINT

INTEGER


oracle.hadoop.exttab.hive.databaseName

The name of a Hive database that contains the input data table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.hive.partitionFilter

A valid HiveQL expression that is used to filter the source Hive table partitions. This property is ignored if the table is not partitioned. For additional information, see "oracle.hadoop.loader.input.hive.partitionFilter". The two properties are identical.

Default value: None. All partitions of the Hive table are mapped to external tables.

Valid values: A valid HiveQL expression. Hive user-defined functions (UDFs) and Hive variables are not supported.

oracle.hadoop.exttab.hive.tableName

The name of an existing Hive table.

The -createTable command requires this property when oracle.hadoop.exttab.sourceType=hive.

oracle.hadoop.exttab.initialFieldEncloser

Specifies the initial field encloser for an external table created from delimited text files. Optional.

Default value: null; no enclosers are specified for the external table definition.

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.locationFileCount

Specifies the desired number of location files for the external table. Applicable only to non-Data-Pump files.

Default value: 4

This property is ignored if the data files are in Data Pump format. Otherwise, the number of location files is the lesser of:

  • The number of data files

  • The value of this property

At least one location file is created.

See "Enabling Parallel Processing" for more information about the number of location files.

oracle.hadoop.exttab.logDirectory

Specifies a database directory where log files, bad files, and discard files are stored. The file names are the default values used by external tables. For example, the name of a log file is the table name followed by _%p.log.

This is an optional property for the -createTable command.

These are the default file name extensions:

  • Log files: log

  • Bad files: bad

  • Discard files: dsc

Valid values: An existing Oracle directory object name.

Unquoted names are changed to uppercase. Quoted names are not changed. Table 2-7 provides examples of how values are transformed.

Table 2-7 Examples of Quoted and Unquoted Values

Specified Value Interpreted Value

my_log_dir:'sales_tab_%p.log '

MY_LOG_DIR/sales_tab_%p.log

'my_log_dir':'sales_tab_%p.log'

my_log_dir/sales_tab_%p.log

"my_log_dir":"sales_tab_%p.log"

my_log_dir/sales_tab_%p.log


oracle.hadoop.exttab.preprocessorDirectory

Specifies the database directory for the preprocessor. The file-system directory must contain the hdfs_stream script.

Default value: OSCH_BIN_PATH

The preprocessor directory is used in the PREPROCESSOR clause of the external table.

oracle.hadoop.exttab.recordDelimiter

Specifies the record delimiter for an external table created from delimited text files. Optional.

Default value: \n

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

oracle.hadoop.exttab.sourceType

Specifies the type of source files. The -createTable and -publish operations require the value of this property.

Default value: text

Valid values: datapump, hive, or text

oracle.hadoop.exttab.stringSizes

Indicates whether the lengths specified for character strings are bytes or characters. This value is used in the STRING SIZES ARE IN clause of the external table. Use characters when loading multibyte character sets. See Oracle Database Utilities.

Default value: BYTES

Valid values: BYTES or CHARACTERS

oracle.hadoop.exttab.tableName

Schema-qualified name of the external table, or the metadata table for partitioned Hive tables, in this format:

schemaName.tableName

If you omit schemaName, then the schema name defaults to the connection user name.

Default value: none

Required property for all operations.

oracle.hadoop.exttab.trailingFieldEncloser

Specifies the trailing field encloser for an external table created from delimited text files. Optional.

Default value: null; defaults to the value of oracle.hadoop.exttab.initialFieldEncloser

The -createTable command uses this property when oracle.hadoop.exttab.sourceType=text.

Valid values: A string in one of the following formats:

  • One or more regular printable characters; it cannot start with \u.

  • One or more encoded characters in the format \uHHHH, where HHHH is a big-endian hexadecimal representation of the character in UTF-16. The hexadecimal digits are case insensitive.

Do not mix the two formats.

Connection Properties 

oracle.hadoop.connection.url

Specifies the database connection string in the thin-style service name format:

jdbc:oracle:thin:@//host_name:port/service_name

If you are unsure of the service name, then enter this SQL command as a privileged user:

SQL> show parameter service

If an Oracle wallet is configured as an external password store, then the property value must start with the driver prefix jdbc:oracle:thin:@ and db_connect_string must exactly match the credentials defined in the wallet.

This property takes precedence over all other connection properties.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.user

An Oracle database log-in name. The externalTable tool prompts for a password. This property is required unless you are using Oracle wallet as an external password store.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.tnsEntryName

Specifies a TNS entry name defined in the tnsnames.ora file.

This property is used with the oracle.hadoop.connection.tns_admin property.

Default value: Not defined

Valid values: A string

oracle.hadoop.connection.tns_admin

Specifies the directory that contains the tnsnames.ora file. Define this property to use transparent network substrate (TNS) entry names in database connection strings. When using TNSNames with the JDBC thin driver, you must set either this property or the Java oracle.net.tns_admin property. When both are set, this property takes precedence over oracle.net.tns_admin.

This property must be set when using Oracle Wallet as an external password store. See oracle.hadoop.connection.wallet_location.

Default value: The value of the Java oracle.net.tns_admin system property

Valid values: A string

oracle.hadoop.connection.wallet_location

A file path to an Oracle wallet directory where the connection credential is stored.

Default value: Not defined

Valid values: A string

When using Oracle Wallet as an external password store, set these properties:

2.11 Performance Tips for Querying Data in HDFS

Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are inserting data.