Skip Headers
Oracle® Big Data Connectors User's Guide
Release 1 (1.0)

Part Number E27365-06
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

2 Oracle Direct Connector for Hadoop Distributed File System

This chapter describes how use Oracle Direct Connector for Hadoop Distributed File System (Oracle Direct Connector) to facilitate data access between a Hadoop Distributed File System (HDFS) and Oracle Database.

This chapter contains the following topics:

2.1 About Oracle Direct Connector

Oracle Direct Connector runs on the system where Oracle Database runs. It provides read access to HDFS from Oracle Database by using external tables.

An external table is an Oracle Database object that identifies the location of data outside of the database. Oracle Database accesses the data by using the metadata provided when the external table was created. By querying the external tables, users can access data stored in HDFS as if that data were stored in tables in the database. External tables are often used to stage data to be transformed during a database load.

These are a few ways that you can use Oracle Direct Connector:

Oracle Direct Connector uses the ORACLE_LOADER access driver.

Note:

Oracle Direct Connector requires a database patch before it can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com and search for bug 13079417.

See Also:

2.2 Creating an External Table for HDFS

You create an external table for HDFS the same as any other external table, except that you must specify this PREPROCESSOR clause in the SQL CREATE TABLE command:

PREPROCESSOR HDFS_BIN_PATH:hdfs_stream

HDFS_BIN_PATH is the name of the Oracle directory object that points to the bin subdirectory where Oracle Direct Connector is installed. See "Installing Oracle Direct Connector".

To access Data Pump files, you must also specify this access parameter:

EXTERNAL VARIABLE DATA

2.2.1 Basic SQL Syntax for the External Table

Following is the basic SQL syntax for creating an external table for HDFS:

CREATE TABLE [schema.]table
     (   column datatype, ...
     )
     ORGANIZATION EXTERNAL
     (   
         TYPE ORACLE_LOADER
         DEFAULT DIRECTORY directory
         ACCESS PARAMETERS  
        (   PREPROCESSOR HDFS_BIN_PATH:hdfs_stream
            access_parameters...
        )  
     LOCATION (file1,file2...)
     );
schema.table

Name of the external table to be created.

column

Name of the columns in the table.

datatype

Data type of the column, which is limited to the ones supported by ORACLE_LOADER. It performs some data type conversions automatically.

directory

Name of the default directory object used by the external table for all input and output files, such as the location files, log files, and bad record files. Do not use the directory where the data is stored for these files.

access_parameters

Any additional subclauses in the ORACLE_LOADER access_parameters clause, such as record and field formatting.

file1, file2...

The names of the location files that identify the paths to the data in HDFS. For CSV content in HDFS, specify two or more location file names, because the degree of parallelism is limited by the number of location files. For Data Pump content in HDFS, specify one location file for each Data Pump file.

Oracle Direct Connector creates these files in the default directory. If the files already exist, they are overwritten.

For other types of external tables, the location files contain the data, but Oracle Direct Connector retains the data in HDFS.

2.2.2 Testing the External Table

After creating the external table, query it to verify that the preprocessor script is configured correctly:

SELECT count(*) FROM external_table;

If the query returns no rows and no errors, then you can continue.

2.2.3 External Table Example

Example 2-1 creates an external table named SALES_HDFS_EXT_TAB in the SCOTT schema. The SALES_HDFS_EXT_TAB external table is created in a database directory named SALES_EXT_DIR. SCOTT must have read and write privileges on this directory.

To create the SALES_EXT_DIR database directory: 

  1. Create the file system directory:

    $ mkdir /scratch/sales_ext_dir
    $ chmod 664 /scratch/sales_ext_dir
    
  2. Open a SQL command interface:

    $ sqlplus / as sysdba
    
  3. Create the database directory:

    SQL> CREATE OR REPLACE DIRECTORY sales_ext_dir AS '/scratch/sales_ext_dir'
    
  4. Grant read and write access to SCOTT:

    SQL> GRANT READ, WRITE ON DIRECTORY sales_ext_dir TO scott; 
    

Example 2-1 Defining an External Table for HDFS

CREATE TABLE "SCOTT"."SALES_HDFS_EXT_TAB"  
    (  "PROD_ID"          NUMBER(6),
       "CUST_ID"          NUMBER,
       "TIME_ID"          DATE,
       "CHANNEL_ID"       CHAR(1),
       "PROMO_ID"         NUMBER(6), 
       "QUANTITY_SOLD"    NUMBER(3),
       "AMOUNT_SOLD"      NUMBER(10,2) 
    )  
    ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_LOADER   
      DEFAULT DIRECTORY  "SALES_EXT_DIR"   
      ACCESS PARAMETERS  
      (   RECORDS DELIMITED BY NEWLINE
          FIELDS TERMINATED BY ',' 
            (
            "PROD_ID" DECIMAL EXTERNAL,
               .
               .
               .
            "TIME_ID" CHAR DATE_FORMAT TIMESTAMP MASK "...",
               .
               .
               .
            )   
          PREPROCESSOR HDFS_BIN_PATH:hdfs_stream    
      )  
      LOCATION ( 'sales1','sale2','sales3')
    );

2.3 Publishing the HDFS Data Paths

The previous procedure for creating an external table only created the metadata in Oracle Database. As mentioned earlier, the location files typically store the data values. In this case, however, the location files are empty. By executing the Oracle Direct Connector ExternalTable command-line tool, you populate the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS. When users query the external table, the Oracle Direct Connector preprocessor uses that information to locate the data in HDFS and stream it to the database.

2.3.1 ExternalTable Command

The ExternalTable command uses the values of several properties to populate the location files. You can specify these property values in an XML document or individually on the command line.

2.3.1.1 Altering HADOOP_CLASSPATH

Before issuing the ExternalTable command, alter the HADOOP_CLASSPATH environment variable to include these JAR files:

  • $ORACLE_HOME/jdbc/lib/ojdbc6.jar: Required.

  • $ORACLE_HOME/jlib/oraclepki.jar: Required only if you using Oracle wallet as an external password store.

See "ExternalTable Command Example".

2.3.1.2 ExternalTable Command Syntax

This is the syntax of the ExternalTable command:

$HADOOP_HOME/bin/hadoop jar $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable [-conf config_file | -D property=value] -publish [-noexecute]
-conf config_file

Identifies the name of an XML configuration file containing the properties needed to populate the location files. See "Creating a Configuration File".

-D property=value

Assigns a value to a specific property

--noexecute

Generates an execution plan, but does not execute any of the actions.

2.3.1.3 ExternalTable Command Example

Example 2-2 sets the HADOOP_CLASSPATH variable and publishes HDFS data paths to the external table created in Example 2-1.

Example 2-2 Publishing HDFS Data Paths to an External Table

This example uses the Bash shell.

$  export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/jlib/oraclepki.jar"
 
$ $HADOOP_HOME/bin/hadoop jar \
  $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable  \
  -D oracle.hadoop.hdfs.exttab.tableName=SALES_HDFS_EXT_TAB \
  -D oracle.hadoop.hdfs.exttab.datasetPaths=hdfs:/user/scott/data/ \
  -D oracle.hadoop.hdfs.exttab.connection.url=jdbc:oracle:thin@myhost:1521/orcl \
  -D oracle.hadoop.hdfs.exttab.connection.user=scott -publish
 

Where: 

  • $HADOOP_HOME is an environment variable pointing to the Hadoop home directory.

  • $DIRECTHDFS_HOME is an environment variable pointing to the Oracle Direct Connector installation directory.

  • SALES_HDFS_EXT_TAB is the external table created in Example 2-1.

  • hdfs:/user/scott/data/ is the location of the HDFS data.

  • @myhost:1521/orcl is the database connection string.

2.3.2 Creating a Configuration File

A configuration file is an XML document with a very simple structure as follows:

<?xml version="1.0"?>
 <configuration>
    <property>
      <name>property</name>
      <value>value</value> 
    </property> 
        .
        .
        . 
</configuration>

See "Configuration Properties" for descriptions of these properties.

Example 2-3 shows a configuration file.

Example 2-3 Configuration File for Oracle Direct Connector

<?xml version="1.0"?>
 <configuration>
    <property>
      <name>oracle.hadoop.hdfs.exttab.tableName</name>
      <value>SH.SALES_EXT_DIR</value> 
    </property> 
   <property> 
      <name>oracle.hadoop.hdfs.exttab.datasetPaths</name>
      <value>/data/s1/*.csv,/data/s2/*.csv</value> 
    </property>   
    <property> 
      <name>oracle.hadoop.hdfs.exttab.datasetCompressionCodec</name>                
      <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
    </property>   
    <property>
      <name>oracle.hadoop.hdfs.exttab.connection.url</name>
      <value>
       jdbc:oracle:thin:@example.com:1521/example.example.com
     </value> 
    </property> 
    <property>
      <name>oracle.hadoop.hdfs.exttab.connection.user</name>
      <value>SH</value> 
    </property>      
</configuration>

2.3.3 Configuration Properties

Following are the configuration properties used by the ExternalTable command to create links to the data files in HDFS. These properties are required:

oracle.hadoop.hdfs.exttab.connection.url
oracle.hadoop.hdfs.exttab.datasetPaths
oracle.hadoop.hdfs.exttab.tableName

2.3.3.1 Property Descriptions

oracle.hadoop.hdfs.exttab.connection.url

The URL of the database connection string. This property overrides all other connection properties. The connecting database user must have the privileges described in "Granting User Access to Oracle Direct Connector". Required.

Using a Wallet

If you are using an Oracle wallet as an external password store, then the property value must have this form:

jdbc:oracle:thin:@db_connect_string

The db_connect_string must exactly match the credential in the wallet.

This example uses Oracle Net Services syntax:

jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
      (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))

The next example uses a TNSNAMES entry:

jdbc:oracle:thin:@my_tns_entry

See also oracle.hadoop.hdfs.exttab.connection.wallet_location.

Not Using a Wallet

If you are not using an Oracle wallet, then use one of the following URL connection styles.

  • Thin Connection:

    jdbc:oracle:thin:@//myhost:1521/my_db_service_name
    
  • Oracle Net Services:

    jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
       (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
          (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))
    
  • TNS Entry Name:

    jdbc:oracle:thin:@myTNSEntryName
    

This parameter is required when not using a wallet for connections:

oracle.hadoop.hdfs.exttab.connection.user

An Oracle Database user name.

oracle.hadoop.hdfs.exttab.connection.tns_admin

File path to a directory containing SQL*Net configuration files, such as sqlnet.ora and tnsnames.ora. The value of the TNS_ADMIN environment variable is used for this property by default.

Define this property to use TNS entry names in database connect strings.

oracle.hadoop.hdfs.exttab.connection.tnsEntryName

A TNS entry name defined in the tnsnames.ora file. This property is used with oracle.hadoop.hdfs.exttab.connection.tns_admin.

oracle.hadoop.hdfs.exttab.datasetCompressionCodec

The class name of the compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec interface. The Decompressor class from the codec is used by the preprocessor script to decompress data for the external table. This codec applies to the entire data set.

Specify this property if the data set contains compressed files.

oracle.hadoop.hdfs.exttab.datasetPathFilter

The class name of a path filter that implements the org.apache.hadoop.fs.PathFilter interface. The paths in the data set are selected only if this filter class accepts them.

oracle.hadoop.hdfs.exttab.datasetPaths

A comma-separated list of fully qualified HDFS paths. This parameter enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-1. Required.

For example, to select all files in /data/s2/, and only the CSV files in /data/s7/, /data/s8/, and /data/s9/, enter this expression:

/data/s2/,/data/s[7-9]/*.csv

The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.

The data set can contain compressed files or uncompressed files, but not both.

Table 2-1 Pattern Matching Characters

Character Description

?

Matches any single character

*

Matches zero or more characters

[abc]

Matches a single character from the character set {a, b, c}.

[a-b]

Matches a single character from the character range {a...b}. The character a must be less than or equal to b.

[^a]

Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket.

\c

Removes any special meaning of character c. The backslash is the escape character.

{ab\,cd}

Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

{ab\,c{de\,fh}}

Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.


oracle.hadoop.hdfs.exttab.connection.wallet_location

File path to an Oracle wallet where the connection information is stored. When using Oracle wallet as an external password store, set the following additional properties.

For a URL connection: 

For TNS names: 

oracle.hadoop.hdfs.exttab.tableName

Schema-qualified name of the external table in the format

schemaName.tableName

See "Creating an External Table for HDFS" about creating an external table for Oracle Direct Connector. Required.

2.4 Querying Data in HDFS

Parallel processing is extremely important when working with large volumes of data. When using external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading data into Oracle Database from the external files created by Oracle Direct Connector, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance when inserting data.