2 Oracle Direct Connector for Hadoop Distributed File System

This chapter describes how use Oracle Direct Connector for Hadoop Distributed File System (Oracle Direct Connector for HDFS) to facilitate data access between Hadoop Distributed File System (HDFS) and Oracle Database.

This chapter contains the following sections:

About Oracle Direct Connector for HDFS
Creating an External Table for HDFS
Publishing the HDFS Data Paths
Querying Data in HDFS

2.1 About Oracle Direct Connector for HDFS

Oracle Direct Connector for HDFS runs on the system where Oracle Database runs. It provides read access to HDFS from Oracle Database by using external tables.

An external table is an Oracle Database object that identifies the location of data outside of a database. Oracle Database accesses the data by using the metadata provided when the external table was created. By querying the external tables, users can access data stored in HDFS as if that data were stored in tables in a database. External tables are often used to stage data to be transformed during a database load.

These are a few ways that you can use Oracle Direct Connector for HDFS:

To access any data stored in HDFS files
To access comma-separated values (CSV) files and Oracle Data Pump files generated by Oracle Loader for Hadoop

CSV files contain plain text in delimited record format. The delimiter is typically a comma. Data Pump files are written in a proprietary, binary format.
To load data extracted and transformed by Oracle Data Integrator

Oracle Direct Connector for HDFS uses the ORACLE_LOADER access driver.

Note:

Oracle Direct Connector for HDFS requires a patch to Oracle Database before the connector can access Data Pump files produced by Oracle Loader for Hadoop. To download this patch, go to http://support.oracle.com and search for bug 14557588.

2.2 Creating an External Table for HDFS

You create an external table for HDFS the same way as any other external table, except that you must specify this PREPROCESSOR clause in the SQL CREATE TABLE command:

PREPROCESSOR HDFS_BIN_PATH:hdfs_stream

HDFS_BIN_PATH is the name of the database directory that points to the bin subdirectory where Oracle Direct Connector for HDFS is installed. See "Installing Oracle Direct Connector for HDFS".

To access Data Pump files, you must also specify this access parameter:

EXTERNAL VARIABLE DATA

2.2.1 Basic SQL Syntax for the External Table

Following is the basic SQL syntax for creating an external table for HDFS:

CREATE TABLE [schema.]table
     (   column datatype, ...
     )
     ORGANIZATION EXTERNAL
     (   
         TYPE ORACLE_LOADER
         DEFAULT DIRECTORY directory
         ACCESS PARAMETERS  
        (   PREPROCESSOR HDFS_BIN_PATH:hdfs_stream
            access_parameters...
        )  
     LOCATION (file1,file2...)
     );

schema.table

Name of the external table to be created.

column

Name of the columns in the table.

datatype

Data type of the column, which is limited to the data types supported by ORACLE_LOADER. Oracle Database performs some data type conversions automatically.

directory

Name of the default directory object used by the external table for all input and output files, such as the location files, log files, and bad record files. Do not use the directory where the data is stored for these files.

access_parameters

Any additional subclauses in the ORACLE_LOADER access_parameters clause, such as record and field formatting.

file1, file2...

The names of the location files that identify the paths to the data in HDFS. For CSV content in HDFS, specify two or more location file names, because the degree of parallelism is limited by the number of location files. For Data Pump content in HDFS, specify one location file for each Data Pump file.

Oracle Direct Connector for HDFS creates these files in the default directory. If the files already exist, they are overwritten.

For other types of external tables, the location files contain the data, but Oracle Direct Connector for HDFS retains the data in HDFS.

2.2.2 Testing the External Table

After creating the external table, query it to verify that the preprocessor script is configured correctly:

SELECT count(*) FROM external_table;

If the query returns no rows and no errors, then you can continue.

2.2.3 External Table Example

Example 2-1 creates an external table named SALES_HDFS_EXT_TAB in the SCOTT schema. The SALES_HDFS_EXT_TAB external table is created in a database directory named SALES_EXT_DIR. SCOTT must have read and write privileges on this directory.

To create the SALES_EXT_DIR database directory:

Create the file system directory:

$ mkdir /scratch/sales_ext_dir
$ chmod 664 /scratch/sales_ext_dir

Open a SQL command interface:
```
$ sqlplus / as sysdba
```

Create the database directory:

SQL> CREATE OR REPLACE DIRECTORY sales_ext_dir AS '/scratch/sales_ext_dir'

Grant read and write access to SCOTT:

SQL> GRANT READ, WRITE ON DIRECTORY sales_ext_dir TO scott;

Example 2-1 Defining an External Table for HDFS

CREATE TABLE "SCOTT"."SALES_HDFS_EXT_TAB"  
    (  "PROD_ID"          NUMBER(6),
       "CUST_ID"          NUMBER,
       "TIME_ID"          DATE,
       "CHANNEL_ID"       CHAR(1),
       "PROMO_ID"         NUMBER(6), 
       "QUANTITY_SOLD"    NUMBER(3),
       "AMOUNT_SOLD"      NUMBER(10,2) 
    )  
    ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_LOADER   
      DEFAULT DIRECTORY  "SALES_EXT_DIR"   
      ACCESS PARAMETERS  
      (   RECORDS DELIMITED BY NEWLINE
          PREPROCESSOR HDFS_BIN_PATH:hdfs_stream    
          FIELDS TERMINATED BY ',' 
            (
            "PROD_ID" DECIMAL EXTERNAL,
               .
               .
               .
            "TIME_ID" CHAR DATE_FORMAT TIMESTAMP MASK "...",
               .
               .
               .
            )   
      )  
      LOCATION ( 'sales1','sale2','sales3')
    );

2.3 Publishing the HDFS Data Paths

The previous procedure for creating an external table only created the metadata in Oracle Database. As mentioned earlier, the location files typically store the data values. In this case, however, the location files are empty. By executing the Oracle Direct Connector for HDFS ExternalTable command-line tool, you populate the location files with the Universal Resource Identifiers (URIs) of the data files in HDFS.

When users query the external table, the Oracle Direct Connector for HDFS preprocessor uses this information to locate the data in HDFS and stream it to the database.

2.3.1 ExternalTable Command

The ExternalTable command uses the values of several properties to populate the location files. You can specify these property values in an XML document or individually on the command line.

2.3.1.1 Altering HADOOP_CLASSPATH

Before issuing the ExternalTable command, alter the HADOOP_CLASSPATH environment variable to include these JAR files:

$ORACLE_HOME/jdbc/lib/ojdbc6.jar: Required.
$ORACLE_HOME/jlib/oraclepki.jar: Required only if you are using an Oracle wallet as an external password store.

See "ExternalTable Command Example".

2.3.1.2 ExternalTable Command Syntax

This is the syntax of the ExternalTable command:

$HADOOP_HOME/bin/hadoop jar $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable [-conf config_file | -D property=value] -publish [--noexecute]

-conf config_file: Identifies the name of an XML configuration file containing the properties needed to populate the location files. See "Creating a Configuration File".
-D property=value: Assigns a value to a specific property.
--noexecute: Generates an execution plan, but does not execute any of the actions.

2.3.1.3 ExternalTable Command Example

Example 2-2 sets the HADOOP_CLASSPATH variable and publishes HDFS data paths to the external table created in Example 2-1.

Example 2-2 Publishing HDFS Data Paths to an External Table

This example uses the Bash shell.

$  export HADOOP_CLASSPATH="$ORACLE_HOME/jdbc/lib/ojdbc6.jar:$ORACLE_HOME/jlib/oraclepki.jar"
 
$ $HADOOP_HOME/bin/hadoop jar \
  $DIRECTHDFS_HOME/jlib/orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable  \
  -D oracle.hadoop.hdfs.exttab.tableName=SALES_HDFS_EXT_TAB \
  -D oracle.hadoop.hdfs.exttab.datasetPaths=hdfs:/user/scott/data/ \
  -D oracle.hadoop.hdfs.exttab.connection.url=jdbc:oracle:thin@myhost:1521/orcl \
  -D oracle.hadoop.hdfs.exttab.connection.user=scott -publish

In this example:

$HADOOP_HOME is an environment variable pointing to the Hadoop home directory.
$DIRECTHDFS_HOME is an environment variable pointing to the Oracle Direct Connector for HDFS installation directory.
SALES_HDFS_EXT_TAB is the external table created in Example 2-1.
hdfs:/user/scott/data/ is the location of the HDFS data.
@myhost:1521/orcl is the database connection string.

2.3.2 How to Publish Data Pump Files

When you are publishing Data Pump files:

Set this property:

oracle.hadoop.hdfs.exttab.datapumpMode=true

Ensure that the number of location files of the external table equals the number of files in the data set.

If the publish operation fails with an error message that indicates that the number of files in the data set doesn't match the number of location files, then alter the location clause of the external table to match the number of data set files listed in the error message and retry the publish operation.

2.3.3 Creating a Configuration File

A configuration file is an XML document with a very simple structure as follows:

<?xml version="1.0"?>
 <configuration>
    <property>
      <name>property</name>
      <value>value</value> 
    </property> 
        .
        .
        . 
</configuration>

See "Configuration Properties" for descriptions of these properties.

Example 2-3 shows a configuration file.

Example 2-3 Configuration File for Oracle Direct Connector for HDFS

<?xml version="1.0"?>
 <configuration>
    <property>
      <name>oracle.hadoop.hdfs.exttab.tableName</name>
      <value>SH.SALES_EXT_DIR</value> 
    </property> 
   <property> 
      <name>oracle.hadoop.hdfs.exttab.datasetPaths</name>
      <value>/data/s1/*.csv,/data/s2/*.csv</value> 
    </property>   
    <property> 
      <name>oracle.hadoop.hdfs.exttab.datasetCompressionCodec</name>                
      <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
    </property>   
    <property>
      <name>oracle.hadoop.hdfs.exttab.connection.url</name>
      <value>
       jdbc:oracle:thin:@example.com:1521/example.example.com
     </value> 
    </property> 
    <property>
      <name>oracle.hadoop.hdfs.exttab.connection.user</name>
      <value>SH</value> 
    </property>      
</configuration>

2.3.4 Configuration Properties

Following are the configuration properties used by the ExternalTable command to create links to the data files in HDFS. These properties are required:

oracle.hadoop.hdfs.exttab.connection.url
oracle.hadoop.hdfs.exttab.datasetPaths
oracle.hadoop.hdfs.exttab.tableName
oracle.hadoop.hdfs.exttab.datapumpMode for Data Pump files only

2.3.4.1 Property Descriptions

oracle.hadoop.hdfs.exttab.connection.tns_admin

File path to a directory containing Oracle Net configuration files, such as sqlnet.ora and tnsnames.ora. The value of the TNS_ADMIN environment variable is used for this property by default.

Define this property to use TNS entry names in database connect strings.

oracle.hadoop.hdfs.exttab.connection.tnsEntryName

A TNS entry name defined in the tnsnames.ora file. This property is used with oracle.hadoop.hdfs.exttab.connection.tns_admin.

oracle.hadoop.hdfs.exttab.connection.url

The URL of the database connection string. This property overrides all other connection properties. The connecting database user must have the privileges described in "Granting User Access to Oracle Direct Connector for HDFS". Required.

Using a Wallet

If you are using an Oracle wallet as an external password store, then the property value must have this form:

jdbc:oracle:thin:@db_connect_string

The db_connect_string must exactly match the credential in the wallet.

This example uses Oracle Net Services syntax:

jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
      (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))

The next example uses a TNSNAMES entry:

jdbc:oracle:thin:@my_tns_entry

See oracle.hadoop.hdfs.exttab.connection.wallet_location. For information about wallets, see the Oracle Database Advanced Security Administrator's Guide.

Not Using a Wallet

If you are not using an Oracle wallet, then use one of the following URL connection styles:

Thin Connection:

jdbc:oracle:thin:@//myhost:1521/my_db_service_name

Oracle Net Services:

jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
      (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))

TNS Entry Name:
```
jdbc:oracle:thin:@myTNSEntryName
```

This parameter is required when you are not using an Oracle wallet for connections:

oracle.hadoop.hdfs.exttab.connection.user

oracle.hadoop.hdfs.exttab.connection.user

An Oracle Database user name.

oracle.hadoop.hdfs.exttab.connection.wallet_location

File path to an Oracle wallet where the connection information is stored. When you are using a wallet as an external password store, set the following properties:

For a URL connection:

oracle.hadoop.hdfs.exttab.connection.wallet_location
oracle.hadoop.hdfs.exttab.connection.url

For TNS names:

oracle.hadoop.hdfs.exttab.connection.wallet_location
oracle.hadoop.hdfs.exttab.connection.tns_admin
oracle.hadoop.hdfs.exttab.connection.tnsEntryName

oracle.hadoop.hdfs.exttab.datapumpMode

Indicates that Data Pump files are being published. Set to true when you are publishing Data Pump files.

oracle.hadoop.hdfs.exttab.datasetCompressionCodec

The class name of the compression codec that implements the org.apache.hadoop.io.compress.CompressionCodec interface. The Decompressor class from the codec is used by the preprocessor script to decompress data for the external table. This codec applies to the entire data set.

Specify this property if the data set contains compressed files.

oracle.hadoop.hdfs.exttab.datasetPathFilter

The class name of a path filter that implements the org.apache.hadoop.fs.PathFilter interface. The paths in the data set are selected only if this filter class accepts them.

oracle.hadoop.hdfs.exttab.datasetPaths

A comma-separated list of fully qualified HDFS paths. This parameter enables you to restrict the input by using special pattern-matching characters in the path specification. See Table 2-1. Required.

For example, to select all files in /data/s2/, and only the CSV files in /data/s7/, /data/s8/, and /data/s9/, enter this expression:

/data/s2/,/data/s[7-9]/*.csv

The external table accesses the data contained in all listed files and all files in listed directories. These files compose a single data set.

The data set can contain compressed files or uncompressed files, but not both.

Table 2-1 Pattern-Matching Characters

Character	Description
?	Matches any single character
*	Matches zero or more characters
[`abc`]	Matches a single character from the character set {a, b, c}
[`a-b`]	Matches a single character from the character range {a...b}. The character a must be less than or equal to b.
[^`a`]	Matches a single character that is not from character set or range {a}. The carat (^) must immediately follow the left bracket.
\`c`	Removes any special meaning of character c. The backslash is the escape character.
{`ab`\,`cd`}	Matches a string from the string set {ab, cd}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.
{`ab`\,`c`{`de\`,`fh`}}	Matches a string from the string set {ab, cde, cfh}. Precede the comma with an escape character (\) to remove the meaning of the comma as a path separator.

oracle.hadoop.hdfs.exttab.tableName

Schema-qualified name of the external table in the format

schemaName.tableName

See "Creating an External Table for HDFS" for information about creating an external table for Oracle Direct Connector for HDFS. Required.

2.4 Querying Data in HDFS

Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle Direct Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are inserting data.