A Hive Knowledge Modules

This appendix provides information about the Hive knowledge modules.

This chapter includes the following sections:

A.1 LKM SQL to Hive SQOOP

This KM integrates data from a JDBC data source into Hive.

  1. Create a Hive staging table.

  2. Create a SQOOP configuration file, which contains the upstream query.

  3. Execute SQOOP to extract the source data and import into Hive

  4. Drop the Hive staging table.

This is a direct load LKM and will ignore any of the target IKM.

The following table descriptions the options for LKM SQL to Hive SQOOP.


Table A-1 LKM SQL to Hive SQOOP

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging. Default: true.

SQOOP_PARALLELISM

Number of SQOOP parallel mappers

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

SPLIT_BY

Target column name for splitting the source data.

Specifies the unqualified target column name to be used for splitting the source data into n chunks for parallel extraction, where n is SQOOP_PARALLELISM.

To achieve equally sized data chunks the split column should contain homogeneously distributed values.

For calculating the data chunk boundaries a query similar to SELECT MIN(EMPNO), MAX(EMPNO) from EMPLOYEE EMP is used. To avoid an extra full table scan the split column should be backed by an index.

BOUNDARY_QUERY

Query to retrieve min/max value for calculating data chunks using SPLIT_BY column.

For splitting the source data into chunks for parallel extraction the minimum and maximum value of the split column is retrieved (KM option SPLIT-BY). In certain situations this may not be the best boundaries or not the most performant way to retrieve the boundaries. In such cases this KM option can be set to a SQL query returning one row with two columns, lowest value and highest value to be used for split-column. This range will be divided into SQOOP_PARALLELISM chunks for parallel extraction.

Example for hard-coded ranges for an Oracle source:

SELECT 1000, 2000 FROM DUAL

For preserving context independence regular table names should be inserted through odiRef.getObjectName calls.

For example:

SELECT MIN(EMPNO), MAX(EMPNO) FROM <%=odiRef.getObjectName(EMP")%>"

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.2 LKM SQL to File SQOOP Direct

This KM extracts data from a JDBC data source into an HDFS file

It executes the following steps:

  1. Create a SQOOP configuration file, which contains the upstream query.

  2. Execute SQOOP to extract the source data and store it as an HDFS file

This is a direct load LKM and must be used without any IKM.

Note:

The entire target directory will be removed prior to extraction.

The following table descriptions the options for LKM SQL to File SQOOP Direct.


Table A-2 LKM SQL to File SQOOP Direct

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging. Default: true.

SQOOP_PARALLELISM

Number of SQOOP parallel mappers

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

SPLIT_BY

Target column name for splitting the source data.

Specifies the unqualified target column name to be used for splitting the source data into n chunks for parallel extraction, where n is SQOOP_PARALLELISM.

To achieve equally sized data chunks the split column should contain homogeneously distributed values.

For calculating the data chunk boundaries a query similar to SELECT MIN(EMPNO), MAX(EMPNO) from EMPLOYEE EMP is used. To avoid an extra full table scan the split column should be backed by an index.

BOUNDARY_QUERY

Query to retrieve min/max value for calculating data chunks using SPLIT_BY column.

For splitting the source data into chunks for parallel extraction the minimum and maximum value of the split column is retrieved (KM option SPLIT-BY). In certain situations this may not be the best boundaries or not the most performant way to retrieve the boundaries. In such cases this KM option can be set to a SQL query returning one row with two columns, lowest value and highest value to be used for split-column. This range will be divided into SQOOP_PARALLELISM chunks for parallel extraction.

Example for hard-coded ranges for an Oracle source:

SELECT 1000, 2000 FROM DUAL

For preserving context independence regular table names should be inserted through odiRef.getObjectName calls.

For example:

SELECT MIN(EMPNO), MAX(EMPNO) FROM <%=odiRef.getObjectName(EMP")%>"

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.3 LKM SQL to HBase SQOOP Direct

This KM extacts data from a JDBC data source and imports the data into HBase.

It executes the following steps:

  1. Create a SQOOP configuration file, which contains the upstream query.

  2. Execute SQOOP to extract the source data and import into HBase.

This is a direct load LKM and must be used without any IKM.

The following table descriptions the options for LKM SQL to HBase SQOOP Direct.


Table A-3 LKM SQL to HBase SQOOP Direct

Option Description

CREATE_TARG_TABLE

Create target table?

Check this option, if you wish to create the target table.

TRUNCATE

Replace existing target data?

Set this option to true, if you wish to replace any existing target table content with the new data.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging. Default: true.

SQOOP_PARALLELISM

Number of SQOOP parallel mappers

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

SPLIT_BY

Target column name for splitting the source data.

Specifies the unqualified target column name to be used for splitting the source data into n chunks for parallel extraction, where n is SQOOP_PARALLELISM.

To achieve equally sized data chunks the split column should contain homogeneously distributed values.

For calculating the data chunk boundaries a query similar to SELECT MIN(EMPNO), MAX(EMPNO) from EMPLOYEE EMP is used. To avoid an extra full table scan the split column should be backed by an index.

BOUNDARY_QUERY

Query to retrieve min/max value for calculating data chunks using SPLIT_BY column.

For splitting the source data into chunks for parallel extraction the minimum and maximum value of the split column is retrieved (KM option SPLIT-BY). In certain situations this may not be the best boundaries or not the most performant way to retrieve the boundaries. In such cases this KM option can be set to a SQL query returning one row with two columns, lowest value and highest value to be used for split-column. This range will be divided into SQOOP_PARALLELISM chunks for parallel extraction.

Example for hard-coded ranges for an Oracle source:

SELECT 1000, 2000 FROM DUAL

For preserving context independence regular table names should be inserted through odiRef.getObjectName calls.

For example:

SELECT MIN(EMPNO), MAX(EMPNO) FROM <%=odiRef.getObjectName(EMP")%>"

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.4 LKM File to SQL SQOOP

This KM integrates data from HDFS files into a JDBC target.

It executes the following steps:

  1. Create a SQOOP configuration file

  2. Load data using SQOOP into a work table on RDBMS

  3. Drop the work table.

The following table descriptions the options for LKM File to SQL SQOOP.


Table A-4 LKM File to SQL SQOOP

Option Description

SQOOP_PARALLELISM

Number of SQOOP parallel mappers.

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

WORK_TABLE_OPTIONS

Work table options.

Use this option if you wish to override standard technology specific work table options. When left blank, these options values are used.

Oracle: NOLOGGING

DB2 UDB: NOT LOGGED INITIALLY

Teradata: no fallback, no before journal, no after journal

TERADATA_WORK_TABLE_TYPE

Teradata work table type.

Use SET or MULTISET table for work table.

TERADATA_OUTPUT_METHOD

Teradata Load Method.

Specifies the way the Teradata Connector will load the data. Valid values are:

  • batch.insert: multiple JDBC connections using batched prepared statements (simplest to start with)

  • multiple.fastload: multiple FastLoad connections

  • internal.fastload: single coordinated FastLoad connections (most performant)

Please see Cloudera's Teradata Connectors User Guide for more details.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.5 LKM Hive to SQL SQOOP

This KM integrates data from Hive into a JDBC target.

It executes the following steps:

  1. Unload data into HDFS

  2. Create a SQOOP configuration file

  3. Load data using SQOOP into a work table on RDBMS

  4. Drop the work table

The following table descriptions the options for LKM Hive to SQL SQOOP.


Table A-5 LKM Hive to SQL SQOOP

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

SQOOP_PARALLELISM

Number of SQOOP parallel mappers.

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

WORK_TABLE_OPTIONS

Work table options.

Use this option if you wish to override standard technology specific work table options. When left blank, these options values are used.

Oracle: NOLOGGING

DB2 UDB: NOT LOGGED INITIALLY

Teradata: no fallback, no before journal, no after journal

TERADATA_WORK_TABLE_TYPE

Teradata work table type.

Use SET or MULTISET table for work table.

TERADATA_OUTPUT_METHOD

Teradata Load Method.

Specifies the way the Teradata Connector will load the data. Valid values are:

  • batch.insert: multiple JDBC connections using batched prepared statements (simplest to start with)

  • multiple.fastload: multiple FastLoad connections

  • internal.fastload: single coordinated FastLoad connections (most performant)

Please see Cloudera's Teradata Connectors User Guide for more details.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.6 LKM HBase to SQL SQOOP

This KM integrates data from HBase into a JDBC target.

It executes the following steps:

  1. Create a SQOOP configuration file

  2. Create a Hive table definition for the HBase table

  3. Unload data from Hive (HBase) using SQOOP into a work table on RDBMS

  4. Drop the work table.

The following table descriptions the options for LKM HBase to SQL SQOOP.


Table A-6 LKM HBase to SQL SQOOP

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging. Default: true.

HIVE_STAGING_LSCHEMA

Logical schema name for Hive-HBase-SerDe table.

The unloading from HBase data is done via Hive. This KM option defines the Hive database, which will be used for creating the Hive HBase-SerDe table for unloading the HBase data.

SQOOP_PARALLELISM

Number of SQOOP parallel mappers.

Specifies the degree of parallelism. More precisely the number of mappers.

Number of mapper processes used for extraction.

When SQOOP_PARALLELISM > 1, SPLIT_BY must be defined.

WORK_TABLE_OPTIONS

Work table options.

Use this option if you wish to override standard technology specific work table options. When left blank, these options values are used.

Oracle: NOLOGGING

DB2 UDB: NOT LOGGED INITIALLY

Teradata: no fallback, no before journal, no after journal

TERADATA_WORK_TABLE_TYPE

Teradata work table type.

Use SET or MULTISET table for work table.

TERADATA_OUTPUT_METHOD

Teradata Load Method.

Specifies the way the Teradata Connector will load the data. Valid values are:

  • batch.insert: multiple JDBC connections using batched prepared statements (simplest to start with)

  • multiple.fastload: multiple FastLoad connections

  • internal.fastload: single coordinated FastLoad connections (most performant)

Please see Cloudera's Teradata Connectors User Guide for more details.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

USE_GENERIC_JDBC_CONNECTOR

Use SQOOP's generic JDBC connector?

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector may provide a solution.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

Extra optional properties for SQOOP file: section Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

Extra optional properties for SQOOP file: section SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.

Extra optional properties for SQOOP file: section SQOOP connector properties.


A.7 LKM HDFS File to Hive

This KM will load data only from HDFS file into Hive. The file can be in the format of JSON, Avro, Parquet, Delimited with complex data.


Table A-7 LKM HDFS File to Hive

Option Description

STOP_ON_FILE_NOT_FOUND

This checkbox option defines whether the KM should stop, if no input file is found.

OVERRIDE_ROW_FORMAT

This option allows to override the entire Hive row format definition of the staging table or the target table.

DELETE_TEMPORARY_OBJECTS

Set this option to No, if you want to retain the temporary objects (tables, files and scripts) post integration.


A.8 LKM HDFS File to Hive (Direct)

This KM will load data only from HDFS file into Hive Data Direct directly into hive target table, bypassing the staging table for better performance.


Table A-8 LKM HDFS to Hive (Direct)

Option Description

STOP_ON_FILE_NOT_FOUND

This checkbox option defines whether the KM should stop, if no input file is found.

OVERRIDE_ROW_FORMAT

This option allows to override the entire Hive row format definition of the staging table or the target table.

DELETE_TEMPORARY_OBJECTS

Set this option to No, if you want to retain the temporary objects (tables, files and scripts) post integration.

CREATE_TARG_TABLE

Create target table?

Check this option, if you wish to create the target table.

TRUNCATE

Replace existing target data?

Set this option to true, if you wish to replace any existing target table content with the new data.


A.9 IKM Hive Append

This KM integrates data into a Hive target table in append or replace (truncate) mode.

The following table descriptions the options for IKM Hive Append.


Table A-9 IKM Hive Append

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.


Note:

If there is a column containing a Complex Type in the target Hive table, this must not be left unmapped. Hive does not allow setting null values to complex columns.

A.10 IKM Hive Incremental Update

This IKM integrates data incrementally into a Hive target table. The KM should be assigned on Hive target node.

Target data store integration type needs to be defined as Incremental Update in order to get this KM on the list of available KMs for assignment.


Table A-10 IKM Hive Incremental Update

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.


A.11 LKM File to Hive LOAD DATA

Integration from a flat file staging area to Hive using Hive's LOAD DATA command.

This KM executes the following steps:

  1. Create a flow table in Hive

  2. Declare data files to Hive (LOAD DATA command)

  3. Load data from Hive staging table into target table

The KM can handle filename wildcards (*, ?).">

The following table describes the options for LKM File to Hive LOAD DATA.


Table A-11 LKM File to Hive LOAD DATA

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

EXTERNAL_TABLE

Preserve file in original location?

Defines whether to declare the target/staging table as externally managed.

Default: false

For non-external tables Hive manages all data files. That is, it will *move* any data files into <hive.metastore.warehouse.dir>/<table_name>. For external tables Hive does not move or delete any files. It will load data from the location given by the ODI schema.

If EXTERNAL_TABLE is set to true:

All files in the directory given by the physical data schema will be loaded. So any filename or wildcard information from the source DataStore's resource name will be ignored.

The directory structure and file names must comply with Hives directory organization for tables, e.g. for partitioning and clustering.

The directory and its files must reside in HDFS.

No Hive LOAD-DATA-statements are submitted and thus loading of files to a specific partition (using a target-side expression) is not possible.

FILE_IS_LOCAL

Is this a local file?

Defines whether the source file is to be considered local (= outside of the current Hadoop cluster).

Default: true

If FILE_IS_LOCAL is set to true, the data file(s) are copied into the Hadoop cluster first.

If FILE_IS_LOCAL is set to false, the data file(s) are moved into the Hadoop cluster and therefore will no longer be available at their source location. If the source file is already in HDFS, FILE_IS_LOCAL=false results in just a file rename and therefore very fast operation. This option only applies, if EXTERNAL_TABLE is set to false.

STOP_ON_FILE_NOT_FOUND

Stop if no input file was found?

This checkbox option defines whether the KM should stop, if no input file has been found.

OVERRIDE_ROW_FORMAT

Custom row format clause.

This option allows to override the entire Hive row format definition of the staging table (in case USE_STAGE_TABLE is set to true) or the target table (in case USE_STAGE_TABLE is set to false). It contains the text to be used for row format definition.

Example for reading Apache Combined WebLog files:

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' <EOL>WITH SERDEPROPERTIES ( <EOL> input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")"


A.12 LKM File to Hive LOAD DATA Direct

Direct integration from a flat file into Hive without any staging using Hive's LOAD DATA command.

This is a direct load LKM and must be used without any IKM.

The KM can handle filename wildcards (*, ?).

The following table describes the options for LKM File to Hive LOAD DATA Direct.


Table A-12 LKM File to Hive LOAD DATA Direct

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

EXTERNAL_TABLE

Preserve file in original location?

Defines whether to declare the target/staging table as externally managed.

Default: false

For non-external tables Hive manages all data files. That is, it will *move* any data files into <hive.metastore.warehouse.dir>/<table_name>. For external tables Hive does not move or delete any files. It will load data from the location given by the ODI schema.

If EXTERNAL_TABLE is set to true:

All files in the directory given by the physical data schema will be loaded. So any filename or wildcard information from the source DataStore's resource name will be ignored.

The directory structure and file names must comply with Hives directory organization for tables, e.g. for partitioning and clustering.

The directory and its files must reside in HDFS.

No Hive LOAD-DATA-statements are submitted and thus loading of files to a specific partition (using a target-side expression) is not possible.

FILE_IS_LOCAL

Is this a local file?

Defines whether the source file is to be considered local (= outside of the current Hadoop cluster).

Default: true

If FILE_IS_LOCAL is set to true, the data file(s) are copied into the Hadoop cluster first.

If FILE_IS_LOCAL is set to false, the data file(s) are moved into the Hadoop cluster and therefore will no longer be available at their source location. If the source file is already in HDFS, FILE_IS_LOCAL=false results in just a file rename and therefore very fast operation. This option only applies, if EXTERNAL_TABLE is set to false.

STOP_ON_FILE_NOT_FOUND

Stop if no input file was found?

This checkbox option defines whether the KM should stop, if no input file has been found.

OVERRIDE_ROW_FORMAT

Custom row format clause.

This option allows to override the entire Hive row format definition of the staging table (in case USE_STAGE_TABLE is set to true) or the target table (in case USE_STAGE_TABLE is set to false). It contains the text to be used for row format definition.

Example for reading Apache Combined WebLog files:

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' <EOL>WITH SERDEPROPERTIES ( <EOL> input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")"


A.13 LKM HBase to Hive HBASE-SERDE

This LKM provides read access to a HBase table from the Hive.

This is achieved by defining a temporary load table definition on Hive which represents all relevant columns of the HBase source table.

A.14 LKM Hive to HBase Incremental Update HBASE-SERDE Direct

This LKM loads data from Hive into HBase and supports inserting new rows as well as updating existing data.

This is a direct load LKM and must be used without any IKM.

The following table describes the options for LKM Hive to HBase Incremental Update HBASE-SERDE Direct.


Table A-13 LKM Hive to HBase Incremental Update HBASE-SERDE Direct

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.

HBASE_WAL

Disable Write-Ahead-Log.

HBase uses a Write-Ahead-Log to protect against data loss. For better performance, WAL can be disabled. Please note that this setting applies to all Hive commands executed later in this session.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.


A.15 LKM Hive to File Direct

This LKM unloads data from Hive into flat files.

This is a direct load LKM and must be used without any IKM.

The following table describes the options for LKM Hive to File Direct.


Table A-14 LKM Hive to File Direct

Option Description

FILE_IS_LOCAL

Is this a local file?

Defines whether the target file is to be considered local (outside of the current Hadoop cluster).

STORED_AS

File format.

Defines whether the target file is to be stored as plain text file (TEXTFILE) or compressed (SEQUENCEFILE).


A.16 XKM Hive Sort

This XKM sorts data using an expression.

The following table describes the options for XKM Hive Sort.


Table A-15 XKM Hive Sort

Option Description

SORT_MODE

Select the mode the SORT operator will generate code for.


A.17 LKM File to Oracle OLH-OSCH

This KM integrates data from an HDFS file into an Oracle staging table using Oracle Loader for Hadoop (OLH) and/or Oracle SQL Connector for Hadoop (OSCH).

The KM can handle filename wildcards (*, ?).

The following table describes the options for LKM File to Oracle OLH-OSCH.


Table A-16 LKM Hive to Oracle OLH-OSCH

Option Description

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

OLH_OUTPUT_MODE

How to transfer data into Oracle?

This option specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY|DP_OSCH, and OSCH.

  • JDBC output mode: The data is inserted using a number of direct insert JDBC connections.

    In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode.

    For direct loading (no C$ table), the target table must be partitioned. For standard loading, FLOW_TABLE_OPTIONS must explicitely specify partitioning: e.g. PARTITION BY HASH(COL1) PARTITIONS 4".

    In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "Hadoop fs -copyToLocal" command to the local path specified by EXT_TAB_DIR_LOCATION. - Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete.

REJECT_LIMIT

Max number of errors for OLH/EXTTAB.

Enter the maximum number of errors allowed in the file. Examples: UNLIMITED to except all errors. Integer value (10 to allow 10 rejections).

This value is used in OLH job definitions as well as in external table definitions.

EXT_TAB_DIR_LOCATION

Directory for ext tab data files.

File system path of the external table.

Note:

  • Only applicable, if OLH_OUTPUT_MODE = DP_* or OSCH

  • For OLH_OUTPUT_MODE = DP_*: this path must be accessible both from the ODI agent and from the target database engine.

  • For OLH_OUTPUT_MODE = DP_*: the name of the external directory object is the I$ table name.

  • For OLH_OUTPUT_MODE = DP_COPY: ODI agent will use hadoop-fs command to copy dp files into this directory.

  • For OLH_OUTPUT_MODE = DP_*|OSCH: this path will contain any external table log/bad/dsc files.

  • ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.

WORK_TABLE_OPTIONS

Option for Flow table creation.

Use this option to specify the attributes for the integration table at create time and used for increasing performance.

This option is set by default to NOLOGGING.

This option may be left empty.

OVERRIDE_INPUTFORMAT

Class name of InputFormat.

By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat.

Default: <empty>.

Cannot be used with OLH_OUTPUT_MODE=OSCH.

For example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: oracle.hadoop.loader.lib.input.RegexInputFormat

See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.

EXTRA_OLH_CONF_PROPERTIES

Optional extra OLH properties.

Allows adding extra parameters to OLH. E.g. for changing the default OLH date format:

<property>

<name>oracle.hadoop.loader.defaultDateFormat</name>

<value>yyyy-MM-dd HH:mm:ss</value>

</property>

Particularly when using custom InputFormats (see KM option OVERRIDE_INPUTFORMAT for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Default: <empty>

Cannot be used with OLH_OUTPUT_MODE=OSCH

Example (loading apache weblog file format):

When OLH RegexInputFormat is used for reading custom file formats, this KM option specified the regular expression and other parsing details:

<property>

<name>oracle.hadoop.loader.input.regexPattern</name>

<value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ \]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")</value>

<description>RegEx for Apache WebLog format</description>

</property>"

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout, and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".


A.18 LKM File to Oracle OLH-OSCH Direct

This KM integrates data from an HDFS file into an Oracle target using Oracle Loader for Hadoop (OLH) and/or Oracle SQL Connector for Hadoop (OSCH)

The KM can handle filename wildcards (*, ?).

This is a direct load LKM (no staging) and must be used without any IKM.

The following table describes the options for LKM File to Oracle OLH-OSCH Direct.


Table A-17 LKM File to Oracle OLH-OSCH Direct

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.

DELETE_ALL

Delete all rows.

Set this option to true, if you wish to replace the target table content with the new data.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

OLH_OUTPUT_MODE

How to transfer data into Oracle?

This option specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY|DP_OSCH, and OSCH.

  • JDBC output mode: The data is inserted using a number of direct insert JDBC connections.

    In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode.

    For direct loading (no C$ table), the target table must be partitioned. For standard loading, FLOW_TABLE_OPTIONS must explicitely specify partitioning: e.g. PARTITION BY HASH(COL1) PARTITIONS 4".

    In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "Hadoop fs -copyToLocal" command to the local path specified by EXT_TAB_DIR_LOCATION. - Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete.

REJECT_LIMIT

Max number of errors for OLH/EXTTAB.

Enter the maximum number of errors allowed in the file. Examples: UNLIMITED to except all errors. Integer value (10 to allow 10 rejections).

This value is used in OLH job definitions as well as in external table definitions.

EXT_TAB_DIR_LOCATION

Directory for ext tab data files.

File system path of the external table.

Note:

  • Only applicable, if OLH_OUTPUT_MODE = DP_* or OSCH

  • For OLH_OUTPUT_MODE = DP_*: this path must be accessible both from the ODI agent and from the target database engine.

  • For OLH_OUTPUT_MODE = DP_*: the name of the external directory object is the I$ table name.

  • For OLH_OUTPUT_MODE = DP_COPY: ODI agent will use hadoop-fs command to copy dp files into this directory.

  • For OLH_OUTPUT_MODE = DP_*|OSCH: this path will contain any external table log/bad/dsc files.

  • ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.

WORK_TABLE_OPTIONS

Option for Flow table creation.

Use this option to specify the attributes for the integration table at create time and used for increasing performance.

This option is set by default to NOLOGGING.

This option may be left empty.

OVERRIDE_INPUTFORMAT

Class name of InputFormat.

By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat.

Default: <empty>.

Cannot be used with OLH_OUTPUT_MODE=OSCH.

For example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: oracle.hadoop.loader.lib.input.RegexInputFormat

See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.

EXTRA_OLH_CONF_PROPERTIES

Optional extra OLH properties.

Allows adding extra parameters to OLH. E.g. for changing the default OLH date format:

<property>

<name>oracle.hadoop.loader.defaultDateFormat</name>

<value>yyyy-MM-dd HH:mm:ss</value>

</property>

Particularly when using custom InputFormats (see KM option OVERRIDE_INPUTFORMAT for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Default: <empty>

Cannot be used with OLH_OUTPUT_MODE=OSCH

Example (loading apache weblog file format):

When OLH RegexInputFormat is used for reading custom file formats, this KM option specified the regular expression and other parsing details:

<property>

<name>oracle.hadoop.loader.input.regexPattern</name>

<value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ \]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")</value>

<description>RegEx for Apache WebLog format</description>

</property>"

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout, and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".


A.19 LKM Hive to Oracle OLH-OSCH

This KM integrates data from a Hive query into an Oracle staging table using Oracle Loader for Hadoop (OLH) and/or Oracle SQL Connector for Hadoop (OSCH).

The following table describes the options for LKM Hive to Oracle OLH-OSCH.


Table A-18 LKM Hive to Oracle OLH-OSCH

Option Description

USE_HIVE_STAGING_TABLE

Use intermediate Hive staging table?

By default the Hive source data is getting materialized in a Hive staging table prior to extraction by OLH. If USE_HIVE_STAGING_TABLE is set to false, OLH directly accesses the Hive source data.

USE_HIVE_STAGING_TABLE=0 is only possible, if all these conditions are true.

  • Only a single source table

  • No transformations, filters, joins.

  • No datasets

  • USE_HIVE_STAGING_TABLE=0 provides better performance by avoiding an extra data transfer step.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

OLH_OUTPUT_MODE

How to transfer data into Oracle?

This option specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY|DP_OSCH, and OSCH.

  • JDBC output mode: The data is inserted using a number of direct insert JDBC connections.

    In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode.

    For direct loading (no C$ table), the target table must be partitioned. For standard loading, FLOW_TABLE_OPTIONS must explicitely specify partitioning: e.g. PARTITION BY HASH(COL1) PARTITIONS 4".

    In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "Hadoop fs -copyToLocal" command to the local path specified by EXT_TAB_DIR_LOCATION. - Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete.

REJECT_LIMIT

Max number of errors for OLH/EXTTAB.

Enter the maximum number of errors allowed in the file. Examples: UNLIMITED to except all errors. Integer value (10 to allow 10 rejections).

This value is used in OLH job definitions as well as in external table definitions.

EXT_TAB_DIR_LOCATION

Directory for ext tab data files.

File system path of the external table.

Note:

  • Only applicable, if OLH_OUTPUT_MODE = DP_* or OSCH

  • For OLH_OUTPUT_MODE = DP_*: this path must be accessible both from the ODI agent and from the target database engine.

  • For OLH_OUTPUT_MODE = DP_*: the name of the external directory object is the I$ table name.

  • For OLH_OUTPUT_MODE = DP_COPY: ODI agent will use hadoop-fs command to copy dp files into this directory.

  • For OLH_OUTPUT_MODE = DP_*|OSCH: this path will contain any external table log/bad/dsc files.

  • ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.

WORK_TABLE_OPTIONS

Option for Flow table creation.

Use this option to specify the attributes for the integration table at create time and used for increasing performance.

This option is set by default to NOLOGGING.

This option may be left empty.

OVERRIDE_INPUTFORMAT

Class name of InputFormat.

By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat.

Default: <empty>.

Cannot be used with OLH_OUTPUT_MODE=OSCH.

For example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: oracle.hadoop.loader.lib.input.RegexInputFormat

See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.

EXTRA_OLH_CONF_PROPERTIES

Optional extra OLH properties.

Allows adding extra parameters to OLH. E.g. for changing the default OLH date format:

<property>

<name>oracle.hadoop.loader.defaultDateFormat</name>

<value>yyyy-MM-dd HH:mm:ss</value>

</property>

Particularly when using custom InputFormats (see KM option OVERRIDE_INPUTFORMAT for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Default: <empty>

Cannot be used with OLH_OUTPUT_MODE=OSCH

Example (loading apache weblog file format):

When OLH RegexInputFormat is used for reading custom file formats, this KM option specified the regular expression and other parsing details:

<property>

<name>oracle.hadoop.loader.input.regexPattern</name>

<value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ \]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")</value>

<description>RegEx for Apache WebLog format</description>

</property>"

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout, and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".


A.20 LKM Hive to Oracle OLH-OSCH Direct

This KM integrates data from a Hive query into an Oracle target using Oracle Loader for Hadoop (OLH) and/or Oracle SQL Connector for Hadoop (OSCH)

This is a direct load LKM and must be used without any IKM.

The following table describes the options for LKM Hive to Oracle OLH-OSCH.


Table A-19 LKM Hive to Oracle OLH-OSCH

Option Description

CREATE_TARG_TABLE

Create target table.

Check this option if you wish to create the target table.

TRUNCATE

Replace all target table data.

Set this option to true, if you wish to replace the target table content with the new data.

DELETE_ALL

Delete all rows.

Set this option to true, if you wish to replace the target table content with the new data.

USE_HIVE_STAGING_TABLE

Use intermediate Hive staging table?

By default the Hive source data is getting materialized in a Hive staging table prior to extraction by OLH. If USE_HIVE_STAGING_TABLE is set to false, OLH directly accesses the Hive source data.

USE_HIVE_STAGING_TABLE=0 is only possible, if all these conditions are true.

  • Only a single source table

  • No transformations, filters, joins.

  • No datasets

  • USE_HIVE_STAGING_TABLE=0 provides better performance by avoiding an extra data transfer step.

DELETE_TEMPORARY_OBJECTS

Delete temporary objects at end of mapping.

Set this option to NO, if you wish to retain temporary objects (tables, files and scripts) after integration. Useful for debugging.

OLH_OUTPUT_MODE

How to transfer data into Oracle?

This option specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY|DP_OSCH, and OSCH.

  • JDBC output mode: The data is inserted using a number of direct insert JDBC connections.

    In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode.

    For direct loading (no C$ table), the target table must be partitioned. For standard loading, FLOW_TABLE_OPTIONS must explicitely specify partitioning: e.g. PARTITION BY HASH(COL1) PARTITIONS 4".

    In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "Hadoop fs -copyToLocal" command to the local path specified by EXT_TAB_DIR_LOCATION. - Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete.

REJECT_LIMIT

Max number of errors for OLH/EXTTAB.

Enter the maximum number of errors allowed in the file. Examples: UNLIMITED to except all errors. Integer value (10 to allow 10 rejections).

This value is used in OLH job definitions as well as in external table definitions.

EXT_TAB_DIR_LOCATION

Directory for ext tab data files.

File system path of the external table.

Note:

  • Only applicable, if OLH_OUTPUT_MODE = DP_* or OSCH

  • For OLH_OUTPUT_MODE = DP_*: this path must be accessible both from the ODI agent and from the target database engine.

  • For OLH_OUTPUT_MODE = DP_*: the name of the external directory object is the I$ table name.

  • For OLH_OUTPUT_MODE = DP_COPY: ODI agent will use hadoop-fs command to copy dp files into this directory.

  • For OLH_OUTPUT_MODE = DP_*|OSCH: this path will contain any external table log/bad/dsc files.

  • ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.

WORK_TABLE_OPTIONS

Option for Flow table creation.

Use this option to specify the attributes for the integration table at create time and used for increasing performance.

This option is set by default to NOLOGGING.

This option may be left empty.

OVERRIDE_INPUTFORMAT

Class name of InputFormat.

By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat.

Default: <empty>.

Cannot be used with OLH_OUTPUT_MODE=OSCH.

For example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: oracle.hadoop.loader.lib.input.RegexInputFormat

See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.

EXTRA_OLH_CONF_PROPERTIES

Optional extra OLH properties.

Allows adding extra parameters to OLH. E.g. for changing the default OLH date format:

<property>

<name>oracle.hadoop.loader.defaultDateFormat</name>

<value>yyyy-MM-dd HH:mm:ss</value>

</property>

Particularly when using custom InputFormats (see KM option OVERRIDE_INPUTFORMAT for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Default: <empty>

Cannot be used with OLH_OUTPUT_MODE=OSCH

Example (loading apache weblog file format):

When OLH RegexInputFormat is used for reading custom file formats, this KM option specified the regular expression and other parsing details:

<property>

<name>oracle.hadoop.loader.input.regexPattern</name>

<value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ \]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")</value>

<description>RegEx for Apache WebLog format</description>

</property>"

MAPRED_OUTPUT_BASE_DIR

MapReduce Output Directory.

This option specifies an hdfs directory, where SQOOP will create subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

TEMP_DIR

Local directory for temporary files.

Directory used for storing temporary files like squoop script, stdout, and stderr redirects.

Leave blank to use system's default temp dir (<?=System.getProperty(java.io.tmp")?>)".


A.21 RKM Hive

RKM Hive reverses these metadata elements:

  • Hive tables and views as data stores.

    Specify the reverse mask in the Mask field, and then select the tables and views to reverse. The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).

  • Hive columns as attributes with their data types.

  • Information about buckets, partitioning, clusters, and sort columns are set in the respective flex fields in the data store or column metadata.

A.22 RKM HBase

RKM HBase reverses these metadata elements:

  • HBase tables as data stores.

    Specify the reverse mask in the Mask field, and then select the tables to reverse. The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).

  • HBase columns as attributes with their data types.

  • HBase unique row key as attribute called key.

Note:

This RKM uses the oracle.odi.km logger for logging. You can enable logging by changing log level for oracle.odi.km logger to TRACE:16 in ODI-logging-config.xml as shown below:

<logger name="oracle.odi.km" level="TRACE:16" useParentHandlers="true"/>
<logger name="oracle.odi.studio.message.logger.proxy" level="TRACE:16" useParentHandlers="false"/>

For more information about logging configuration in ODI, please see Runtime Logging for ODI components in Administering Oracle Data Integrator.

The following table describes the options for RKM HBase.


Table A-20 RKM HBase Options

Option Description

SCAN_MAX_ROWS

Specifies the maximum number of rows to be scanned during reversing of a table. The default value is 10000.

SCAN_START_ROW

Specifies the key of the row to start the scan on. By default the scan will start on the first row. The row key is specified as a Java expressions returning an instance of org.apache.hadoop.hbase.util.Bytes. Example: Bytes.toBytes(?EMP000001?).

SCAN_STOP_ROW

Specifies the key of the row to stop the scan on? By default the scan will run to the last row of the table or up to SCAN_MAX_ROWS is reached. The row key is specified as a Java expressions returning an instance of org.apache.hadoop.hbase.util.Bytes. Example: Bytes.toBytes(?EMP000999?).

Only applies if SCAN_START_ROW is specified.

SCAN_ONLY_FAMILY

Restricts the scan to column families, whose name match this pattern. SQL-LIKE wildcards percentage (%) and underscore (_) can be used. By default all column families are scanned.


A.23 IKM File to Hive (Deprecated)

Note: This KM is deprecated and only used for backward compatibility.

IKM File to Hive (Load Data) supports:

  • One or more input files. To load multiple source files, enter an asterisk or a question mark as a wildcard character in the resource name of the file DataStore (for example, webshop_*.log).

  • File formats:

    • Fixed length

    • Delimited

    • Customized format

  • Loading options:

    • Immediate or deferred loading

    • Overwrite or append

    • Hive external tables

The following table describes the options for IKM File to Hive (Load Data). See the knowledge module for additional details.


Table A-21 IKM File to Hive Options

Option Description

CREATE_TARG_TABLE

Check this option, if you wish to create the target table. In case USE_STAGING_TABLE is set to false, please note that data will only be read correctly, if the target table definition, particularly the row format and file format details, are correct.

TRUNCATE

Set this option to true, if you wish to replace the target table/partition content with the new data. Otherwise the new data will be appended to the target table. If TRUNCATE and USE_STAGING_TABLE are set to false, all source file names must be unique and must not collide with any data files already loaded into the target table.

FILE_IS_LOCAL

Defines whether the source file is to be considered local (outside of the current Hadoop cluster). If this option is set to true, the data file(s) are copied into the Hadoop cluster first. The file has to be accessible by the Hive server through the local or shared file system. If this option is set to false, the data file(s) are moved into the Hadoop cluster and therefore will no longer be available at their source location. If the source file is already in HDFS, setting this option is set to false results in just a file rename, and therefore the operation is very fast.

This option only applies, if EXTERNAL_TABLE is set to false.

EXTERNAL_TABLE

Defines whether to declare the target/staging table as externally managed. For non-external tables Hive manages all data files. That is, it will move any data files into <hive.metastore.warehouse.dir>/<table_name>. For external tables Hive does not move or delete any files. It will load data from the location given by the ODI schema.

If this option is set to true:

  • All files in the directory given by the physical data schema will be loaded. So any filename or wildcard information from the source DataStore's resource name will be ignored.

  • The directory structure and file names must comply with Hives directory organization for tables, for example, for partitioning and clustering.

  • The directory and its files must reside in HDFS.

  • No Hive LOAD-DATA-statements are submitted and thus loading of files to a specific partition (using a target-side expression) is not possible.

USE_STAGING_TABLE

Defines whether an intermediate staging table will be created.

A Hive staging table is required if:

  • Target table is partitioned, but data spreads across partitions

  • Target table is clustered

  • Target table (partition) is sorted, but input file is not

  • Target table is already defined and target table definition does not match the definition required by the KM

  • Target column order does not match source file column order

  • There are any unmapped source columns

  • There are any unmapped non-partition target columns

  • The source is a fixed length file and the target has non-string columns

In case none of the above is true, this option can be turned off for better performance.

DELETE_TEMPORARY_OBJECTS

Removes temporary objects, such as tables, files, and scripts after integration. Set this option to No if you want to retain the temporary files, which might be useful for debugging.

DEFER_TARGET_LOAD

Defines whether the file(s), which have been declared to the staging table should be loaded into the target table now or during a later execution. Permitted values are START, NEXT, END or <empty>.

This option only applies if USE_STAGE_TABLE is set to true.

The typical use case for this option is when there are multiple files and each of them requires data redistribution/sorting and the files are gathered by calling the interface several times. For example, the interface is used in a package, which retrieves (many small) files from different locations and the location, stored in an variable, is to be used in a target partition column. In this case the first interface execution will have DEFER_TARGET_LOAD set to START, the next interface executions will have DEFER_TARGET_LOAD set to NEXT and set to END for the last interface. The interfaces having DEFER_ TARGET _LOAD set to START/NEXT will just load the data file into HDFS (but not yet into the target table) and can be executed in parallel to accelerate file upload to cluster.

OVERRIDE_ROW_FORMAT

Allows to override the entire Hive row format definition of the staging table (in case USE_STAGE_TABLE is set to true) or the target table (in case USE_STAGE_TABLE is set to false). It contains the text to be used for row format definition.Example for reading Apache Combined WebLog files:

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s" ) STORED AS TEXTFILE

The list of columns in the source DataStore must match the list of input groups in the regular expression (same number of columns and appropriate data types). If USE_STAGE_TABLE is set to false, the number of target columns must match the number of columns returned by the SerDe, in the above example, the number of groups in the regular expression. The number of source columns is ignored (At least one column must be mapped to the target.). All source data is mapped into the target table structure according to the column order, the SerDe's first column is mapped to the first target column, the SerDe's second column is mapped to the second target column, and so on. If USE_STAGE_TABLE is set to true, the source DataStore must have as many columns as the SerDe returns columns. Only data of mapped columns will be transferred.

STOP_ON_FILE_NOT_FOUND

Defines whether the KM should stop, if input file is not found.

HIVE_COMPATIBILE

Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8.

  • 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3).

  • 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.


A.24 LKM HBase to Hive (HBase-SerDe) [Deprecated]

Note: This KM is deprecated and only used for backward compatibility.

LKM HBase to Hive (HBase-SerDe) supports:

  • A single source HBase table.

The following table describes the options for LKM HBase to Hive (HBase-SerDe). See the knowledge module for additional details.


Table A-22 LKM HBase to Hive (HBase-SerDe) Options

Option Description

DELETE_TEMPORARY_OBJECTS

Deletes temporary objects such as tables, files, and scripts post data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.


A.25 IKM Hive to HBase Incremental Update (HBase-SerDe) [Deprecated]

Note: This KM is deprecated and only used for backward compatibility.

IKM Hive to HBase Incremental Update (HBase-SerDe) supports:

  • Filters, Joins, Datasets, Transformations and Aggregations in Hive

  • Inline views generated by IKM Hive Transform

  • Inline views generated by IKM Hive Control Append

The following table describes the options for IKM Hive to HBase Incremental Update (HBase-SerDe). See the knowledge module for additional details.


Table A-23 IKM Hive to HBase Incremental Update (HBase-SerDe) Options

Option Description

CREATE_TARG_TABLE

Creates the HBase target table.

TRUNCATE

Replaces the target table content with the new data. If this option is set to false, the new data is appended to the target table.

DELETE_TEMPORARY_OBJECTS

Deletes temporary objects such as tables, files, and scripts post data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

HBASE_WAL

Enables or disables the Write-Ahead-Log (WAL) that HBase uses to protect against data loss. For better performance, WAL can be disabled.


A.26 IKM SQL to Hive-HBase-File (SQOOP) [Deprecated]

Note: This KM is deprecated and only used for backward compatibility.

IKM SQL to Hive-HBase-File (SQOOP) supports:

  • Mappings on staging

  • Joins on staging

  • Filter expressions on staging

  • Datasets

  • Lookups

  • Derived tables

The following table describes the options for IKM SQL to Hive-HBase-File (SQOOP). See the knowledge module for additional details.


Table A-24 IKM SQL to Hive-HBase-File (SQOOP) Options

Option Description

CREATE_TARG_TABLE

Creates the target table. This option is applicable only if the target is Hive or HBase.

TRUNCATE

Replaces any existing target table content with the new data. For Hive and HBase targets, the target data is truncated. For File targets, the target directory is removed. For File targets, this option must be set to true.

SQOOP_PARALLELISM

Specifies the degree of parallelism. More precisely the number of mapper processes used for extraction.

If SQOOP_PARALLELISM option is set to greater than 1, SPLIT_BY option must be defined.

SPLIT_BY

Specifies the target column to be used for splitting the source data into n chunks for parallel extraction, where n is SQOOP_PARALLELISM. To achieve equally sized data chunks the split column should contain homogeneously distributed values. For calculating the data chunk boundaries a query similar to SELECT MIN(EMP.EMPNO), MAX(EMP.EMPNO) from EMPLOYEE EMP is used. To avoid an extra full table scan the split column should be backed by an index.

BOUNDARY_QUERY

For splitting the source data into chunks for parallel extraction the minimum and maximum value of the split column is retrieved (KM option SPLIT-BY). In certain situations this may not be the best boundaries or not the most optimized way to retrieve the boundaries. In such cases this KM option can be set to a SQL query returning one row with two columns, lowest value and highest value to be used for split-column. This range will be divided into SQOOP_PARALLELISM chunks for parallel extraction. Example for hard-coded ranges for an Oracle source:

SELECT 1000, 2000 FROM DUAL

For preserving context independence, regular table names should be inserted through odiRef.getObjectName calls. For example:

SELECT MIN(EMPNO), MAX(EMPNO) FROM <%=odiRef.getObjectName("EMP")%>

TEMP_DIR

Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory:

<?=System.getProperty("java.io.tmp")?>

MAPRED_OUTPUT_BASE_DIR

Specifies an hdfs directory, where SQOOP creates subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

DELETE_TEMPORARY_OBJECTS

Deletes temporary objects such as tables, files, and scripts after data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

USE_HIVE_STAGING_TABLE

Loads data into the Hive work table before loading into the Hive target table. Set this option to false to load data directly into the target table.

Setting this option to false is only possible, if all these conditions are true:

  • All target columns are mapped

  • Existing Hive table uses standard hive row separators (\n) and column delimiter (\01)

Setting this option to false provides better performance by avoiding an extra data transfer step.

This option is applicable only if the target technology is Hive.

USE_GENERIC_JDBC_CONNECTOR

Specifies whether to use the generic JDBC connector if a connector for the target technology is not available.

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector can be used.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.


A.27 IKM Hive Control Append (Deprecated)

Note: This KM is deprecated and only used for backward compatibility.

This knowledge module validates and controls the data, and integrates it into a Hive target table in truncate/insert (append) mode. Invalid data is isolated in an error table and can be recycled. IKM Hive Control Append supports inline view mappings that use either this knowledge module or IKM Hive Transform.

The following table describes the options for IKM Hive Control Append.


Table A-25 IKM Hive Control Append Options

Option Description

FLOW_CONTROL

Activates flow control.

RECYCLE_ERRORS

Recycles data rejected from a previous control.

STATIC_CONTROL

Controls the target table after having inserted or updated target data.

CREATE_TARG_TABLE

Creates the target table.

TRUNCATE

Replaces the target table content with the new data. Setting this option to true provides better performance.

DELETE_TEMPORARY_OBJECTS

Removes the temporary objects, such as tables, files, and scripts after data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

HIVE_COMPATIBILE

Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8.

  • 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3).

  • 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.


A.28 CKM Hive (Deprecated)

Note: This KM is deprecated and only used for backward compatibility.

This knowledge module checks data integrity for Hive tables. It verifies the validity of the constraints of a Hive data store and diverts the invalid records to an error table. You can use CKM Hive for static control and flow control. You must also define these constraints on the stored data.

The following table describes the options for this check knowledge module.


Table A-26 CKM Hive Options

Option Description

DROP_ERROR_TABLE

Drops error table before execution. When this option is set to YES, the error table will be dropped each time a control is performed on the target table. This means that any rejected records, identified and stored during previous control operations, will be lost. Otherwise previous rejects will be preserved. In addition to the error table, any table called <error table>_tmp will also be dropped.

HIVE_COMPATIBILE

Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8.

  • 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3).

  • 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.


A.29 IKM Hive Transform (Deprecated)

Note: This KM is deprecated and only used for backward compatibility.

This knowledge module performs transformations. It uses a shell script to transform the data, and then integrates it into a Hive target table using replace mode. The knowledge module supports inline view mappings and can be used as an inline-view for IKM Hive Control Append.

The transformation script must read the input columns in the order defined by the source data store. Only mapped source columns are streamed into the transformations. The transformation script must provide the output columns in the order defined by the target data store.

The following table describes the options for this integration knowledge module.


Table A-27 IKM Hive Transform Options

Option Description

CREATE_TARG_TABLE

Creates the target table.

DELETE_TEMPORARY_OBJECTS

Removes the temporary objects, such as tables, files, and scripts post data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

TRANSFORM_SCRIPT_NAME

Defines the file name of the transformation script. This transformation script is used to transform the input data into the output structure. Both local and HDFS paths are supported, for example:

Local script location: file:///tmp/odi/script1.pl

HDFS script location: hdfs://namenode:nnPort/tmp/odi/script1.pl

Ensure that the following requirements are met:

  • The path/file must be accessible by both the ODI agent and the Hive server. Read access for the Hive server is required as it is the Hive server, which executes the resulting MR job invoking the script.

  • If TRANSFORM_SCRIPT is set (ODI creates the script file during mapping execution), the path/file must be writable for the ODI agent, as it is the ODI agent, which writes the script file using the HDFS Java API.

When the KM option TRANSFORM_SCRIPT is set, the following paragraphs provide some configuration help:

  • For HDFS script locations:

    The script file created is owned by the ODI agent user and receives the group of the owning directory. See Hadoop Hdfs Permissions Guide for more details. The standard configuration to cover the above two requirements for HDFS scripts is to ensure that the group of the HDFS script directory includes the ODI agent user (let's assume oracle) as well as the Hive server user (let's assume hive). Assuming that the group hadoop includes oracle and hive, the sample command below adjusts the ownership of the HDFS script directory:

    logon as hdfs user hdfs dfs -chown oracle:hadoop /tmp/odi/myscriptdir

  • For local script locations:

    The script file created is owned by the ODI agent user and receives the ODI agent user's default group, unless SGID has been set on the script directory. If the sticky group bit has been set, the file will be owned by the group of the script directory instead. The standard configuration to cover the above two requirements for local scripts is similar to the HDFS configuration by using the SGID:

    chown oracle:hadoop /tmp/odi/myscriptdir chmod g+s /tmp/odi/myscriptdir

TRANSFORM_SCRIPT

Defines the transformation script content. This transformation script is then used to transform the input data into the output structure. If left blank, the file given in TRANSFORM_SCRIPT_NAME must already exist. If not blank, the script file is created.

Script example (1-to-1 transformation): #! /usr/bin/csh -f cat

All mapped source columns are spooled as tab separated data into this script via stdin. This unix script then transforms the data and writes out the data as tab separated data on stdout. The script must provide as many output columns as there are target columns.

TRANSFORM_SCRIPT_MODE

Unix/HDFS file permissions for script file in octal notation with leading zero. For example, full permissions for owner and group: 0770.

Warning: Using wider permissions like 0777 poses a security risk.

See also KM option description for TRANSFORM_SCRIPT_NAME for details on directory permissions.

PRE_TRANSFORM_DISTRIBUTE

Provides an optional, comma-separated list of source column names, which enables the knowledge module to distribute the data before the transformation script is applied.

PRE_TRANSFORM_SORT

Provide an optional, comma-separated list of source column names, which enables the knowledge module to sort the data before the transformation script is applied.

POST_TRANSFORM_DISTRIBUTE

Provides an optional, comma-separated list of target column names, which enables the knowledge module to distribute the data after the transformation script is applied.

POST_TRANSFORM_SORT

Provides an optional, comma-separated list of target column names, which enables the knowledge module to sort the data after the transformation script is applied.


A.30 IKM File-Hive to Oracle (OLH-OSCH) [Deprecated]

Note: This KM is deprecated and only used for backward compatibility.

IKM File-Hive to Oracle (OLH-OSCH) integrates data from an HDFS file or Hive source into an Oracle database target using . Using the mapping configuration and the selected options, the knowledge module generates an appropriate Oracle Database target instance. Hive and Hadoop versions must follow the requirements.

See Also:

The following table describes the options for this integration knowledge module.


Table A-28 IKM File - Hive to Oracle (OLH-OSCH)

Option Description

OLH_OUTPUT_MODE

Specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY, DP_OSCH, and OSCH.

  • JDBC output mode: The data is inserted using a number of direct insert JDBC connections. In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode. If USE_ORACLE_STAGINGis set to false, target table must be partitioned. If USE_ORACLE_STAGING is set to true, FLOW_TABLE_OPTIONS must explicitly specify partitioning, for example, "PARTITION BY HASH(COL1) PARTITIONS 4". In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks.

  • DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "Hadoop fs -copyToLocal" command to the local path specified by EXT_TAB_DIR_LOCATION. Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete, an external table is defined in the target database, which accesses the files from EXT_TAB_DIR_LOCATION.

  • DP_OSCH output mode: OLH creates a number of DataPump export files. After the export phase an external table is created on the target database, which accesses these output files directly via OSCH. Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete, an external table is defined in the target database, which accesses the files from EXT_TAB_DIR_LOCATION.

  • OSCH output mode: In OSCH mode loading, OLH is bypassed. ODI creates an external table on the target database, which accesses the input files through OSCH. Please note that only delimited and fixed length files can be read. No support for loading from Hive or custom Input Formats such as RegexInputFormat, as there is no OLH pre-processing.

REJECT_LIMIT

Specifies the maximum number of errors for and external table. Examples: UNLIMITED to except all errors. Integer value (10 to allow 10 rejections) This value is used in job definitions as well as in external table definitions.

CREATE_TARG_TABLE

Creates the target table.

TRUNCATE

Replaces the target table content with the new data.

DELETE_ALL

Deletes all the data in target table.

USE_HIVE_STAGING_TABLE

Materializes Hive source data before extraction by . If this option is set to false, directly accesses the Hive source data. Setting this option to false is only possible, if all these conditions are true:

  • Only a single source table

  • No transformations, filters, joins

  • No datasets

Setting this option to false provides better performance by avoiding an extra data transfer step.

This option is applicable only if the source technology is Hive.

USE_ORACLE_STAGING_TABLE

Uses an intermediate Oracle database staging table.

The extracted data is made available to Oracle by an external table. If USE_ORACLE_STAGING_TABLE is set to true (default), the external table is created as a temporary (I$) table. This I$ table data is then inserted into the target table. Setting this option to false is only possible, if all these conditions are true:

  • OLH_OUTPUT_MODE is set to JDBC or OCI

  • All source columns are mapped

  • All target columns are mapped

  • No target-side mapping expressions

Setting this option to false provides better performance by avoiding an extra data transfer step, but may lead to partial data being loaded into the target table, as loads data in multiple transactions.

EXT_TAB_DIR_LOCATION

Specifies the file system path of the external table. Please note the following:

  • Only applicable, if OLH_OUTPUT_MODE = DP_*|OSCH

  • For OLH_OUTPUT_MODE = DP_*: this path must be accessible both from the ODI agent and from the target database engine.

  • For OLH_OUTPUT_MODE = DP_*: the name of the external directory object is the I$ table name.

  • For OLH_OUTPUT_MODE = DP_COPY: ODI agent will use hadoop-fs command to copy dp files into this directory.

  • For OLH_OUTPUT_MODE = DP_*|OSCH: this path will contain any external table log/bad/dsc files.

  • ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.

TEMP_DIR

Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory:

<?=System.getProperty("java.io.tmp")?>

MAPRED_OUTPUT_BASE_DIR

Specifies an HDFS directory, where the job will create subdirectories for temporary files/datapump output files.

FLOW_TABLE_OPTIONS

Specifies the attributes for the integration table at create time and used for increasing performance. This option is set by default to NOLOGGING. This option may be left empty.

DELETE_TEMPORARY_OBJECTS

Removes temporary objects, such as tables, files, and scripts post data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

OVERRIDE_INPUTFORMAT

By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat. Cannot be used with OLH_OUTPUT_MODE=OSCH.

Example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: oracle.hadoop.loader.lib.input.RegexInputFormat

See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.

EXTRA_OLH_CONF_PROPERTIES

Particularly when using custom InputFormats (see KM option OVERRIDE_INPUTFORMAT for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Cannot be used with OLH_OUTPUT_MODE=OSCH.

Example, (loading apache weblog file format): When OLH RegexInputFormat is used for reading custom file formats, this KM option specifies the regular expression and other parsing details:

<property> <name>oracle.hadoop.loader.input.regexPattern</name> <value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") (\".*?\")</value> <description>RegEx for Apache WebLog format</description> </property>


A.31 IKM File-Hive to SQL (SQOOP) [Deprecated]

Note: This KM is deprecated and only used for backward compatibility.

IKM File-Hive to SQL (SQOOP) supports:

  • Filters, Joins, Datasets, Transformations and Aggregations in Hive

  • Inline views generated by IKM Hive Control Append

  • Inline views generated by IKM Hive Transform

  • Hive-HBase source tables using LKM HBase to Hive (HBase SerDe)

  • File source data (delimited file format only)

The following table describes the options for this integration knowledge module.


Table A-29 IKM File-Hive to SQL (SQOOP)

Option Description

CREATE_TARG_TABLE

Creates the target table.

TRUNCATE

Replaces the target datastore content with new data. If this option is set to false, the new data is appended to the target datastore.

DELETE_ALL

Deletes all the rows in the target datastore.

SQOOP_PARALLELISM

Specifies the degree of parallelism. More precisely the number of mappers used during SQOOP export and therefore the number of parallel JDBC connections.

USE_TARGET_STAGING_TABLE

By default the source data is staged into a target-side staging table, before it is moved into the target table. If this option is set to false, SQOOP loads the source data directly into the target table, which provides better performance and less need for tablespace in target RDBMS by avoiding an extra data transfer step.

For File sources setting this option to false is only possible, if all these conditions are met:

  • All source columns must be mapped

  • Source and target columns have same order

  • First file column must map to first target column

  • no mapping gaps

  • only 1-to-1 mappings (no expressions)

Please note the following:

  • SQOOP uses multiple writers, each having their own JDBC connection to the target. Every writer uses multiple transactions for inserting the data. This means that in case USE_TARGET_STAGING_TABLE is set to false, changes to the target table are no longer atomic and writer failures can lead to partially updated target tables.

  • The Teradata Connector for SQOOP always creates an extra staging table during load. This connector staging table is independent of the KM option.

USE_GENERIC_JDBC_CONNECTOR

Specifies whether to use the generic JDBC connector if a connector for the target technology is not available.

For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector can be used.

FLOW_TABLE_OPTIONS

When creating the target-side work table, RDBMS-specific table options can improve performance. By default this option is empty and the knowledge module will use the following table options:

  • For Oracle: NOLOGGING

  • For DB2: NOT LOGGED INITIALLY

  • For Teradata: no fallback, no before journal, no after journal

Any explicit value overrides these defaults.

TEMP_DIR

Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory:

<?=System.getProperty("java.io.tmp")?>

MAPRED_OUTPUT_BASE_DIR

Specifies an HDFS directory, where SQOOP creates subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.

DELETE_TEMPORARY_OBJECTS

Deletes temporary objects such as tables, files, and scripts after data integration. Set this option to NO if you want to retain the temporary objects, which might be useful for debugging.

TERADATA_PRIMARY_INDEX

Primary index for the target table. Teradata uses the primary index to spread data across AMPs. It is important that the chosen primary index has a high cardinality (many distinct values) to ensure evenly spread data to allow maximum processing performance. Please follow Teradata's recommendation on choosing a primary index.

This option is applicable only to Teradata targets.

TERADATA_FLOW_TABLE_TYPE

Type of the Teradata flow table, either SET or MULTISET.

This option is applicable only to Teradata targets.

TERADATA_OUTPUT_METHOD

Specifies the way the Teradata Connector will load the data. Valid values are:

  • batch.insert: multiple JDBC connections using batched prepared statements (simplest to start with)

  • multiple.fastload: multiple FastLoad connections

  • internal.fastload: single coordinated FastLoad connections (most performant)

This option is applicable only to Teradata targets.

EXTRA_HADOOP_CONF_PROPERTIES

Optional generic Hadoop properties.

EXTRA_SQOOP_CONF_PROPERTIES

Optional SQOOP properties.

EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES

Optional SQOOP connector properties.