4 Integrating Hadoop Data

This chapter provides information about the steps you need to perform to integrate Hadoop data.

This chapter includes the following sections:

Integrating Hadoop Data

To integrate Hadoop data, set up data sources, import Hadoop knowledge modules, create oracle data integrator models and design mappings to load, validate, and transform Hadoop data.

The following table summarizes the steps for integrating Hadoop data.

Table 4-1 Integrating Hadoop Data

Step Description

Set Up Data Sources

Set up the data sources to create the data source models. You must set up File, Hive, HDFS, and HBase data sources.

See Setting Up File Data Sources

See Setting Up Hive Data Sources

See Setting Up HBase Data Sources

See Setting Up Kafka Data Sources

See Setting Up Cassandra Data Sources

See Setting Up HDFS Data Sources

Import Hadoop Knowledge Modules

Import the Hadoop KMs into Global Objects or a project.

See Importing Hadoop Knowledge Modules

Create Oracle Data Integrator Models

Reverse-engineer the Hive and HBase models to create Oracle Data Integrator models.

See Creating ODI Models and Data Stores to represent Hive, HBase and Cassandra Tables, and HDFS Files

Configure Hadoop Credential Provider

Configure Hadoop Credential Provider and define the password.

See Password Handling in Hadoop.

Integrate Hadoop Data

Design mappings to load, validate, and transform Hadoop data.

See Loading Data from Files into Hive

See Loading Data from HBase into Hive

See Loading Data from Hive into HBase

See Loading Data from an SQL Database into Hive, HBase, and File using SQOOP

See Validating and Transforming Data Within Hive

See Loading Data into an Oracle Database from Hive and File

See Loading Data into an SQL Database from Hbase, Hive, and File using SQOOP

See Loading Data from Kafka to Spark Processing Engine

See Loading Data from HDFS File to Hive

See Loading Data from HDFS File to Spark

See Loading Data from Hive to Files

Setting Up File Data Sources

To setup file data sources, you need to create a data server object under File technology along with a physical and logical schema for every directory to be accessed.

In the Hadoop context, there is a distinction between files in Hadoop Distributed File System (HDFS) and local files (outside of HDFS).

To define a data source:

  1. Create a Data Server object under File technology.

  2. Create a Physical Schema object for every directory to be accessed.

  3. Create a Logical Schema object for every directory to be accessed.

  4. Create a Model for every Logical Schema.

  5. Create one or more data stores for each different type of file and wildcard name pattern.

  6. HDFS Files are now created using the HDFS Technology as seen in Setting Up HDFS Data Sources. However, for backward compatibility, there are some Big Data File Knowledge Modules that support HDFS Files. To define HDFS files, you must select HDFS File and define the Hadoop DataServer reference. Alternatively, you can create a Data Server object under File technology by entering the HDFS name node in the field JDBC URL and leave the JDBC Driver name empty. For example:

    hdfs://bda1node01.example.com:8020
    

    Test Connection is not supported for this Data Server configuration.

Setting Up HDFS Data Sources

To setup HDFS data sources, you need to create a data server object under HDFS technology along with a physical and logical schema for every directory to be accessed.

This topic provides steps in Oracle Data Integrator that are required for connecting to a HDFS system.
  1. Create a Data Server object under HDFS technology.

    Note:

    HDFS data server should reference an existing Hadoop data server.
  2. Create a Physical Schema object for every directory to be accessed.
  3. Create a Logical Schema object for every directory to be accessed.
  4. Create a Model for every Logical Schema
  5. Create one or more data stores for each different type of file.
    The definition tab has a Resource Name field that enables you to specify which file or files it represents. If wildcards are used, the files must have the same schema and be of the same format (all JSON or all Avro).
  6. Select the appropriate Storage Format and the Schema File.
    The contents of the schema are displayed.
  7. Select the Attributes Tab to either enter, or reverse-engineer the Attributes from the supplied schema.

Setting Up Hive Data Sources

To setup Hive data sources, you need to create a data server object under Hive technology. Oracle Data Integrator connects to Hive by using JDBC.

The following steps in Oracle Data Integrator are required for connecting to a Hive system.

Oracle Data Integrator connects to Hive by using JDBC.

To set up a Hive data source:

  1. Create a Data Server object under Hive technology.

  2. Set the following locations under JDBC:

    JDBC Driver: weblogic.jdbc.hive.HiveDriver

    JDBC URL: jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]

    For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default

    Note:

    Usually User ID and Password are provided in the respective fields of an ODI Data Server. In case where a Hive user is defined without password, you must add password=default as part of the JDBC URL and the password field of Data Server shall be left blank.

  3. Set the following under on the definition tab of the data server:

    Hive Metastore URIs: for example, thrift://BDA:10000

  4. Ensure that the Hive server is up and running.

  5. Test the connection to the Data Server.

  6. Create a Physical Schema. Enter the name of the Hive schema in both schema fields of the Physical Schema definition.

  7. Create a Logical Schema object.

  8. Import RKM Hive into Global Objects or a project.

  9. Create a new model for Hive Technology pointing to the logical schema.

  10. Perform a custom reverse-engineering operation using RKM Hive.

Reverse-engineered Hive table populates the attribute and storage tabs of the data store.

Integrating Hadoop Data

Setting Up HBase Data Sources

To setup HBase data sources, you need to create a data server object under HBase technology along with a physical and logical schema object.

The following steps in Oracle Data Integrator are required for connecting to a HBase system.

To set up a HBase data source:

  1. Create a Data Server object under HBase technology.

    JDBC Driver and URL are not available for data servers of this technology.

  2. Set the following under on the definition tab of the data server:

    HBase Quorum: Quorum of the HBase installation. For example: zkhost1.example.com,zkhost2.example.com,zkhost3.example.com

  3. Ensure that the HBase server is up and running.

    Note:

    You cannot test the connection to the HBase Data Server.

  4. Create a Physical Schema.

  5. Create a Logical Schema object.

  6. Import RKM HBase into Global Objects or a project.

  7. Create a new model for HBase Technology pointing to the logical schema.

  8. Perform a custom reverse-engineering operation using RKM HBase.

    Note:

    Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.

At the end of this process, the HBase Data Model contains all the HBase tables with their columns and data types.

Integrating Hadoop Data

Setting Up Kafka Data Sources

To setup kafka data sources, you need to create a data server object under Kafka technology along with a physical and logical schema object. Create one or more data sources for each different topic and then test the connection to the Data Server.

This following procedure describes how to connect to a Kafka system in Oracle Data Integrator.
  1. Create a Data Server object under Kafka technology.
    For information on creating a Kafka data server, see Kafka Data Server Definition and Kafka Data Server Properties.
  2. Create a Physical Schema object.
  3. Create a Logical Schema object.
  4. Create a Model for every Logical Schema
  5. Create one or more data stores for each different topic.
    Resource Name in the definition tab of data store indicates the Kafka topic . Kafka topic name can be either entered by the user or selected from the list of available Kafka topics in the Kafka cluster. There are two ways to load data from Kafka topics which are receiver-based and direct and LKM Kafka to Spark supports both approaches.
  6. Test the connection to the Data Server.
    For information on Kafka Integration, see Hadoop Data Integration with Oracle Data Integrator.
The Kafka data model contains all the Kafka topics with their columns and data types.

Setting Up Cassandra Data Sources

To setup Cassandra data sources, you need to create a data server object under Casssandra technology. Oracle Data Integrator connects to Cassandra by using JDBC.

This following procedure describes how to connect to a Cassandra system in Oracle Data Integrator.
  1. Create a Data Server object under Cassandra technology.
  2. Set the following locations under JDBC:
    Add the Cassandra JDBC Driver to the Driver List.

    JDBC Driver: weblogic.jdbc.cassandra.CassandraDriver

    JDBC URL: jdbc:weblogic:cassandra://<host>:<port>[;property=value[:...]]

    For example, jdbc:weblogic:cassandra://cassandra.example.com:9042;KeyspaceName=mykeyspace

    Note:

    Latest driver uses the binary protocol and hence uses default port 9042.
  3. Ensure that the Cassandra server is up and running.
  4. Test the connection to the Data Server.
  5. Create a Physical Schema object.
  6. Create a Logical Schema object.
  7. Import RKM Cassandra into Global Objects or a project.
  8. Create a Model for every Logical Schema
  9. Perform a custom reverse-engineering operation using RKM Cassandra.

Importing Hadoop Knowledge Modules

Unlike other built-in Big Data Knowledge Modules, you need to import RKMs and CKMs into your project or as global objects before you use them.

Most of the Big Data Knowledge Modules are built-in the product. The exceptions are the RKMs and CKMs, and these will need to be imported into your project or as global objects before you use them. They are:

  • CKM Hive

  • RKM Hive

  • RKM HBase

  • RKM Cassandra

Integrating Hadoop Data

Creating ODI Models and Data Stores to represent Hive, HBase and Cassandra Tables, and HDFS Files

You must create ODI models to hold the data stores that represent HDFS files or Hive, HBase and Cassandra tables. The reverse-engineering process creates Hive, HBase and Cassandra data stores for the corresponding Hive, HBase and Cassandra tables. You can use these data stores as source or target in your mappings.

This section contains the following topics:

Creating a Model

To create a model that is based on the technology hosting Hive, HBase, Cassandra, or HDFS and on the logical schema created when you configured the Hive, HBase, Cassandra, HDFS or File connection, follow the standard procedure described in Developing Integration Projects with Oracle Data Integrator.

For backward compatibility, the Big Data LKMs reading from Files (LKM File to Hive LOAD DATA), also support reading from HDFS, however the source data store must be from a file model. If reading from HDFS, it is preferable to use KMs like the LKM HDFS to File LOAD DATA . In this case, the source data store must be from an HDFS model.

Reverse-Engineering Hive Tables

RKM Hive is used to reverse-engineer Hive tables and views. To perform a customized reverse-engineering of Hive tables with RKM Hive, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to Hive tables.

The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in a mapping.

For more information about RKM Hive, see RKM Hive.

Hive data stores contain a storage tab allowing you to see how data is stored and formatted within Hive. If the Hive table has been reverse-engineered, then these fields will be automatically populated. If you created this data store from the beginning, with the intention of creating the table when running a mapping (using create target table), then you can choose how the data is formatted by editing these fields.

The target Hive table is created based on the data provided in the Storage and Attribute panels of the Hive data store as shown in Table 4-2 and Table 4-3.

Table 4-2 Hive Data Store Storage Panel Properties

Property Description

Table Type

Select one of the following as the type of Hive table to be created:

  • Managed

  • External

  • <Undefined>

Storage Type

Select one of the following as the type of Data storage:

  • Native

  • Non-Native

  • <Undefined>

Row Format

This property appears when Native is selected as the Storage Type.

Select one of the following as the Row Format:

  • Built-In

  • Delimited

  • SerDe

  • <Undefined>

Record Separator

This property appears when Delimited is selected as the Row Format.

Fill in the following fields:

  • Fields Terminated By

  • Fields Escaped By

  • Collection Items Terminated By

  • Map Keys Terminated By

  • Lines Terminated By

  • File Null Value

SerDe

This property appears when SerDe is selected as the Row Format.

Fill in the SerDe Class field.

Storage Format

This longer Storage Format section appears when Native is selected as the Storage Type.

It contains the following properties:

  • Predefined File Format

  • Input Format (appears when INPUTFORMAT is selected as the Predefined File Format.)

  • Output Format (appears when INPUTFORMAT is selected as the Predefined File Format.)

  • Location (appears when External is selected as the Table Type.)

Select one of the following as the Predefined File Format:

  • INPUTFORMAT

  • SEQUENCEFILE

  • PARQUET

  • TEXTFILE

  • ORC

  • JSON

  • RCFILE

  • AVRO

  • <Undefined>

Storage Handler

This property appears when Non-Native is selected as the Storage Type.

Fill in the Storage Handler Class field.

Storage Format

This shorter Storage Format section appears when Non-Native is selected as the Storage Type.

Fill in the Location field.

Table 4-3 Hive Data Store Attribute Panel Properties

Property Description

Order

Order in which attributes are sequenced.

Name

Name of the attribute.

Type

Data type of the attribute.

Data Format

Data Format of the attribute.

Note:

This field is only used for attributes with a data type of "Complex". The content is populated during reverse-engineering and will contain a definition of the Complex Type.

Length

Physical length of the attribute.

Scale

Scale of the numeric attribute.

Not Null

Specifies if the attribute can be null or not.

SCD Behavior

This is not used for Hive data stores.

Partition By

Select if it is a partition column.

Cluster By

Select if it is a bucketed column.

Sort By

Select to sort data on this column within the bucket.

Note:

You must set the position of this column in the SORTED BY clause. The column whose Sort By value is smaller will get the higher priority. For example, consider three columns, C1 with Sort By = 5, C2 with Sort By = 2, C3 with Sort By = 8. The SORTED BY clause will be SORTED BY (C2, C1, C3).

Sort Direction

Select to sort data in the ascending (ASC) or descending (DESC) order.

The data provided above can also be used to create a Hive DDL when the CREATE_TARG_TABLE option is selected in the LKMs and IKMs.

To fully use the Hive format and storage information, one or more of the following KMs must be used:

  • IKM Hive Append

  • IKM Hive Incremental Update

  • LKM File to Hive LOAD DATA Direct

  • LKM HDFS File to Hive LOAD DATA Direct

Reverse-Engineering HBase Tables

RKM HBase is used to reverse-engineer HBase tables. To perform a customized reverse-engineering of HBase tables with RKM HBase, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to HBase tables.

The reverse-engineering process creates the data stores for the corresponding HBase table. You can use the data stores as either a source or a target in a mapping.

Note:

Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.

For more information about RKM HBase, see RKM HBase.

Reverse-Engineering HDFS Files

HDFS files are represented using data stores based on HDFS technology. The HDFS data stores contain the storage format (JSON, Delimited, etc.), attributes, datatypes, and datatype properties.

In previous versions of ODI, File technology was used to represent HDFS Files, but the storage format information was specified in the mappings. If you have existing mappings that use Knowledge Modules such as LKM File to Hive or LKM File to Spark, then you should continue to represent your HDFS files with File technology.

Note:

The preferred method of representing HDFS files is by using the HDFS technology.

Reverse-Engineering HDFS Files into HDFS Data Stores

To reverse-engineer an HDFS file, perform the following steps:

  1. Create a HDFS data store.

  2. From the Storage Tab, select the Storage Format from the Storage Format drop-down list and specify the complete path of the schema file in the Schema File field.

    The schema file should be located in the local file system.

  3. Click Reverse Engineer operation from the Attributes Tab of the HDFS data store.

Note:

  • There is no need to import an RKM into the project.

  • HDFS reverse-engineering requires a Schema (JSON, Parquet, or Avro), hence HDFS files with a Delimited format cannot be reverse-engineered.

For more information, see the Reverse-engineer a File Model section in Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide .

Creating ODI Models and Data Stores to represent Hive, HBase and Cassandra Tables, and HDFS Files

Reverse-Engineering HDFS Files in File Data Stores

HDFS files can be reverse-engineered like regular files. To reverse-engineer HDFS files, you must copy them to your File System and follow the same process as that to reverse-engineer regular files.

Note:

If the file is large for your local File System, retrieve the first N records from HDFS and place them in a local file.

Reverse-Engineering Cassandra Tables

RKM Cassandra is used to reverse-engineer Cassandra tables. To perform a customized reverse-engineering of Cassandra tables with RKM Cassandra, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator.

The reverse-engineering process creates the data stores for the corresponding Cassandra table. For more information about RKM Cassandra, see RKM Cassandra.

Reverse-Engineering Support for Kafka

Reverse-engineering for Kafka is very similar to reverse-engineering HDFS files.

Create a model based on Kafka technology. Create a data store in that model as mentioned below:

  1. Go to the Definition panel and enter Name and Resource Name.
  2. Go to the Storage panel, select the Storage Format and specify the path of the Schema File.
    The Schema File has to be locally accessible.

    Note:

    The Storage Format can be AVRO, JSON, or PARQUET. The DELIMITED Storage Format is not supported for reverse-engineering. Use Data Store Editor to create a Kafka data store with DELIMITED Storage format.
  3. Go to the Attribute panel and click Reverse Engineer.
    All the attributes specified in the Schema File are listed here.

Password Handling in Hadoop

Before using LKM SQL to Spark, LKM Spark to SQL, and LKM Spark to Cassandra, the Hadoop Credential Provider has to be configured and the password has to be defined.

To use these KMs, it is mandatory to follow the below procedure:

  1. Configure the Hadoop Credential Provider.

    JDBC connection passwords are stored using the Hadoop Credential API. This requires the Hadoop cluster to be configured with at least one Credential Provider.

    Below is an example:

    <property>

    <name>hadoop.security.credential.provider.path</name>

    <value>user:///,jceks://file/tmp/test.jceks,jceks://hdfs@cluster1-ns/my/path/test.jceks</value>

    </property>

    Note:

    The property in the example above should be defined in core-site.xml or its equivalent.
    For the proper configuration applicable to your system and security configuration/needs, see CredentialProvider API Guide.
  2. Create a password alias in Hadoop Credential Provider.

    Once the Hadoop cluster is configured, you must create a credential for each password that Spark will be using to connect to the SQL source or target. ODI will assume the following format for credential alias names:

    odi.<user_name>.<dataserver_name>.password

    The user_name and dataserver_name are obtained from the ODI topology DataServer properties.

    The example below shows the creation of a password alias in Hadoop Credential Provider where the user name is oracle and dataserver is Hadoop_CDH5_5.

    hadoop credential create odi.oracle.Hadoop_CDH5_5.password

Loading Data from Files into Hive

To load data from files into Hive, create data stores for local and HDFS files and create a mapping after which you can select the option LKM file to Hive Load data, to load data from flat files to Hive.

The LKM File to Hive KMs support loading data from HDFS Files and, also local Files. However, if you are using HDFS files, the preferred way is to use the HDFS KMs, as described in Loading Data from HDFS into Hive.

  1. Create the data stores for local files and HDFS files.

    For information on reverse-engineering and configuring local file data sources, see Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide.

  2. Create a mapping using the file data store as the source and the corresponding Hive table as the target.
  3. Use the LKM File to Hive LOAD DATA or the LKM File to Hive LOAD DATA Direct knowledge module specified in the physical diagram of the mapping.

    These integration knowledge modules load data from flat files into Hive, replacing or appending any existing data.

For more information about the KMs, see the following sections:

Loading Data from Hive to Files

To load data from Hive tables to local file system or HDFS files, create data store for the Hive tables and create a mapping after which you can select the option LKM Hive to File Direct Knowledge module, to load data from Hive to flat files.

To load data from Hive tables to a local file system or a HDFS file:

  1. Create a data store for the Hive tables that you want to load in flat files.

    For information about reverse-engineering and configuring Hive data sources, see Setting Up Hive Data Sources.

  2. Create a mapping using the Hive data store as the source and the corresponding File data source as the target.
  3. Use the LKM Hive to File Direct knowledge module, specified in the physical diagram of the mapping.

    This integration knowledge module loads data from Hive into flat Files.

    For more information about LKM Hive to File Direct, see LKM Hive to File Direct.

Loading Data from HBase into Hive

To load data from HBase table into Hive, create data store for the HBase table and create a mapping after which you can select the option LKM HBase to Hive HBASE-SERDE knowledge module, to load data from HBase table into Hive.

To load data from an HBase table into Hive:

  1. Create a data store for the HBase table that you want to load in Hive.

    For information about reverse-engineering and configuring HBase data sources, see Setting Up HBase Data Sources.

  2. Create a mapping using the HBase data store as the source and the corresponding Hive table as the target.
  3. Use the LKM HBase to Hive HBASE-SERDE knowledge module, specified in the physical diagram of the mapping.

    This knowledge module provides read access to an HBase table from Hive.

For more information about LKM HBase to Hive HBASE-SERDE, see LKM HBase to Hive HBASE-SERDE.

Loading Data from Hive into HBase

To load data from Hive to HBase table, create data store for the Hive tables and create a mapping after which you can select the option LKM Hive to HBase Incremental Update HBASE-SERDE Direct knowledge module, to load data from Hive table into HBase.

To load data from a Hive table into HBase:

  1. Create a data store for the Hive tables that you want to load in HBase.

    For information about reverse-engineering and configuring Hive data sources, see Setting Up Hive Data Sources.

  2. Create a mapping using the Hive data store as the source and the corresponding HBase table as the target.
  3. Use the LKM Hive to HBase Incremental Update HBASE-SERDE Direct knowledge module, specified in the physical diagram of the mapping.

    This integration knowledge module loads data from Hive into HBase and supports inserting new rows and, also updating existing data.

For more information about LKM Hive to HBase Incremental Update HBASE-SERDE Direct, see LKM Hive to HBase Incremental Update HBASE-SERDE Direct.

Loading Data from an SQL Database into Hive, HBase, and File using SQOOP

To load data from an SQL Database into Hive, HBase, and File using SQOOP create a data store for the SQL source and create a mapping after which you can select the option IKM SQL to Hive-HBase-File (SQOOP) knowledge module, to load data from a SQL source into Hive, HBase, or Files target using SQOOP.

To load data from an SQL Database into a Hive, HBase, and File target:

  1. Create a data store for the SQL source that you want to load into Hive, HBase, or File target.

    For information about reverse-engineering and configuring SQL data sources, see Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide.

  2. Create a mapping using the SQL source data store as the source and the corresponding HBase table, Hive table, or HDFS files as the target.
  3. Use the IKM SQL to Hive-HBase-File (SQOOP) knowledge module, specified in the physical diagram of the mapping.

    Note:

    The IKM SQL to Hive-HBase-File (SQOOP) is not seeded and has to be manually imported.

    This integration knowledge module loads data from a SQL source into Hive, HBase, or Files target. It uses SQOOP to load the data into Hive, HBase, and File targets. SQOOP uses parallel JDBC connections to load the data.

For more information about IKM SQL to Hive-HBase-File (SQOOP), see IKM SQL to Hive-HBase-File (SQOOP) [Deprecated].

Loading Data from an SQL Database into Hive using SQOOP

To load data from an SQL Database into Hive using SQOOP create a data store for the SQL source and create a mapping after which you can select the option LKM SQL to Hive SQOOP knowledge module, to load data from a SQL source into Hive using SQOOP.

To load data from an SQL Database into a Hive target:

  1. Create a data store for the SQL source that you want to load into Hive target.

    For information about reverse-engineering and configuring SQL data sources, see Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide.

  2. Create a mapping using the SQL source data store as the source and the corresponding Hive table as the target.
  3. Use the LKM SQL to Hive SQOOP knowledge module, specified in the physical diagram of the mapping.

    This KM loads data from a SQL source into Hive. It uses SQOOP to load the data into Hive. SQOOP uses parallel JDBC connections to load the data.

For more information about LKM SQL to Hive SQOOP, see LKM SQL to Hive SQOOP.

Loading Data from an SQL Database into HDFS File using SQOOP

To load data from an SQL Database into a HDFS File target:

  1. Create a data store for the SQL source that you want to load into HDFS File target.

    For information about reverse-engineering and configuring SQL data sources, see Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide.

  2. Create a mapping using the SQL source data store as the source and the corresponding HDFS files as the target.
  3. Use the LKM SQL to File SQOOP Direct knowledge module, specified in the physical diagram of the mapping.

    This integration knowledge module loads data from a SQL source into HDFS Files target. It uses SQOOP to load the data into File targets. SQOOP uses parallel JDBC connections to load the data.

For more information about IKM SQL to Hive-HBase-File (SQOOP), see IKM SQL to Hive-HBase-File (SQOOP) [Deprecated].

Loading Data from an SQL Database into HBase using SQOOP

To load data from an SQL Database into a HBase target:

  1. Create a data store for the SQL source that you want to load into HBase target.

    For information about reverse-engineering and configuring SQL data sources, see Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide.

  2. Create a mapping using the SQL source data store as the source and the corresponding HBase table as the target.
  3. Use the LKM SQL to HBase SQOOP Direct knowledge module, specified in the physical diagram of the mapping.

    This integration knowledge module loads data from a SQL source into HBase target. It uses SQOOP to load the data into HBase targets. SQOOP uses parallel JDBC connections to load the data.

For more information about LKM SQL to HBase SQOOP Direct, see LKM SQL to HBase SQOOP Direct.

Validating and Transforming Data Within Hive

After loading data into Hive, you can validate and transform the data using the following knowledge modules.

Note:

IKM Hive Control Append, CKM Hive, and IKM Hive Transform have to be imported.

Loading Data into an Oracle Database from Hive and File

Use the knowledge modules listed in the following table to load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop.

Table 4-4 Knowledge Modules to load data into Oracle Database

Knowledge Module Use To...

IKM File-Hive to Oracle (OLH-OSCH)

Load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop.

For more information, see IKM File-Hive to Oracle (OLH-OSCH) [Deprecated].

Note:

This KM has to be imported.

LKM File to Oracle OLH-OSCH

Load data from an HDFS file into an Oracle staging table using Oracle Loader for Hadoop.

For more information, see LKM File to Oracle OLH-OSCH.

LKM File to Oracle OLH-OSCH Direct

Load data from an HDFS file into an Oracle database target using Oracle Loader for Hadoop.

For more information, see LKM File to Oracle OLH-OSCH Direct.

LKM Hive to Oracle OLH-OSCH

Load data from a Hive source into an Oracle staging table using Oracle Loader for Hadoop.

For more information, see LKM Hive to Oracle OLH-OSCH.

LKM Hive to Oracle OLH-OSCH Direct

Load data from a Hive source into an Oracle database target using Oracle Loader for Hadoop.

For more information, see LKM Hive to Oracle OLH-OSCH Direct.

Loading Data into an SQL Database from Hbase, Hive, and File using SQOOP

Use the knowledge modules listed in the following table to load data from a HDFS file, HBase source, or Hive source into an SQL database target using SQOOP.

Table 4-5 Knowledge Modules to load data into SQL Database

Knowledge Module Use To...

IKM File-Hive to SQL (SQOOP)

Load data from an HDFS file or Hive source into an SQL database target using SQOOP.

For more information, see IKM File-Hive to SQL (SQOOP) [Deprecated].

Note:

This KM has to be imported.

LKM HBase to SQL SQOOP

Load data from an HBase source into an SQL database target using SQOOP.

For more information, see LKM HBase to SQL SQOOP.

LKM File to SQL SQOOP

Load data from an HDFS file into an SQL database target using SQOOP.

For more information, see LKM File to SQL SQOOP.

LKM Hive to SQL SQOOP

Load data from a Hive source into an SQL database target using SQOOP.

For more information, see LKM Hive to SQL SQOOP.

Loading Data from Kafka to Spark Processing Engine

Loading data from Kafka to Spark.

  1. Create a data store for the Kafka tables that you want to load in Spark.

    For configuring Kafka data sources, see Setting Up Kafka Data Sources.

  2. Create a mapping using the Kafka data store as the source and the File/HDFS/SQL/Hive/Kafka data store as the target. Use Spark Python Physical Schema as the staging location.

    For more information, see Creating a Spark Physical Schema.

  3. Use the Storage function KM option with the value createStream for a receiver-based connection or the value createDirectStream for a direct connection as specified in the physical diagram of the mapping.

    Set the zookeeper.connect and metadata.broker.list Kafka data server properties for the appropriate connection.

    This knowledge module loads data from Kafka into the Spark processing engine. You can use other knowledge modules to load data from Spark into File/HDFS/SQL/Hive/Kafka.

    Note:

    Every Kafka source in an ODI mapping allocates a Spark executor. A Spark Kafka mapping hangs if the number of available executors is low. The number of executors must be atleast n+1 where n is the number of Kafka sources in the mapping. For additional information, refer to Spark Documentation.
For more information about LKM Kafka to Spark, see LKM Kafka to Spark.