This chapter provides information about the steps you need to perform to integrate Hadoop data.
This chapter includes the following sections:
Section 4.11, "Loading Data from an SQL Database into Hive, HBase, and File using SQOOP"
Section 4.12, "Loading Data from an SQL Database into Hive using SQOOP"
Section 4.13, "Loading Data from an SQL Database into File using SQOOP"
Section 4.14, "Loading Data from an SQL Database into HBase using SQOOP"
Section 4.15, "Validating and Transforming Data Within Hive"
Section 4.16, "Loading Data into an Oracle Database from Hive and File"
Section 4.17, "Loading Data into an SQL Database from Hbase, Hive and File using SQOOP"
The following table summarizes the steps for integrating Hadoop data.
Table 4-1 Integrating Hadoop Data
Step | Description |
---|---|
Set Up Data Sources |
Set up the data sources to create the data source models. You must set up File, Hive, and HBase data sources. See Section 4.2, "Setting Up File Data Sources" |
Import Hadoop Knowledge Modules |
Import the Hadoop KMs into Global Objects or a project. |
Create Oracle Data Integrator Models |
Reverse-engineer the Hive and HBase models to create Oracle Data Integrator models. |
Integrate Hadoop Data |
Design mappings to load, validate, and transform Hadoop data. See Section 4.7, "Loading Data from Files into Hive" See Section 4.9, "Loading Data from HBase into Hive" See Section 4.10, "Loading Data from Hive into Hbase" See Section 4.11, "Loading Data from an SQL Database into Hive, HBase, and File using SQOOP" See Section 4.15, "Validating and Transforming Data Within Hive" See Section 4.16, "Loading Data into an Oracle Database from Hive and File" See Section 4.17, "Loading Data into an SQL Database from Hbase, Hive and File using SQOOP" |
In the Hadoop context, there is a distinction between files in Hadoop Distributed File System (HDFS) and local files (outside of HDFS).
To define a data source:
Create a Data Server object under File technology.
Create a Physical Schema object for every directory to be accessed.
Create a Logical Schema object for every directory to be accessed.
Create a Model for every Logical Schema.
Create one or more data stores for each different type of file and wildcard name pattern.
For HDFS files, create a Data Server object under File technology by entering the HDFS name node in the field JDBC URL and leave the JDBC Driver name empty. For example:
hdfs://bda1node01.example.com:8020
Test Connection is not supported for this Data Server configuration.
Note:
No dedicated technology is defined for HDFS files.The following steps in Oracle Data Integrator are required for connecting to a Hive system. Oracle Data Integrator connects to Hive by using JDBC.
The Hive technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference
directory.
You must add all Hive-specific flex fields.
To set up a Hive data source:
Create a Data Server object under Hive technology.
Set the following locations under JDBC:
JDBC Driver: weblogic.jdbc.hive.HiveDriver
JDBC URL: jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]
For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default
Note:
Usually User ID and Password are provided in the respective fields of an ODI Data Server. In case where a Hive user is defined without password, you must addpassword=default
as part of the JDBC URL and the password field of Data Server shall be left blank.Set the following under Flexfields:
Hive Metastore URIs: for example, thrift://BDA:10000
Ensure that the Hive server is up and running.
Test the connection to the Data Server.
Create a Physical Schema. Enter the name of the Hive schema in both schema fields of the Physical Schema definition.
Create a Logical Schema object.
Import RKM Hive into Global Objects or a project.
Create a new model for Hive Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM Hive.
At the end of this process, the Hive Data Model contains all Hive tables with their columns, partitioning, and clustering details stored as flex field values.
The following steps in Oracle Data Integrator are required for connecting to a HBase system.
The HBase technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference directory.
You must add all HBase-specific flex fields.
To set up a HBase data source:
Create a Data Server object under HBase technology.
JDBC Driver and URL are not available for data servers of this technology.
Set the following under Flexfields:
HBase Quorum: Quorum of the HBase installation. For example: zkhost1.mydomain.com,zkhost2.mydomain.com,zkhost3.mydomain.com
Ensure that the HBase server is up and running.
Note:
You cannot test the connection to the HBase Data Server.Create a Physical Schema.
Create a Logical Schema object.
Import RKM HBase into Global Objects or a project.
Create a new model for HBase Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM HBase.
Note:
Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.At the end of this process, the HBase Data Model contains all the HBase tables with their columns and data types.
You need to import the KMs that you want to use in your integration projects.
Import the following KMs into Global Objects or a project:
IKM Hive Control Append
CKM Hive
RKM Hive
RKM HBase
IKM Hive Transform
IKM File-Hive to Oracle (OLH-OSCH)
IKM File-Hive to SQL (SQOOP)
IKM SQL to Hive-HBase-File (SQOOP)
You must create a ODI Model from a reverse-engineered Hive, HBase, and HDFS Models. The reverse engineering process creates Hive and HBase creates data stores for the corresponding Hive and HBase tables. You can use these data stores as source or target in your mappings.
This section contains the following topics:
To create a model that is based on the technology hosting Hive, HBase, or HDFS and on the logical schema created when you configured the Hive, HBase, or HDFS connection, follow the standard procedure described in Developing Integration Projects with Oracle Data Integrator.
RKM Hive is used to reverse engineer Hive tables and views. To perform a customized reverse-engineering of Hive tables with RKM Hive, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to Hive tables.
The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in a mapping.
For more information about RKM Hive, see Section A.18, "RKM Hive".
Table 4-2 describes the created flex fields.
Table 4-2 Flex Fields for Reverse-Engineered Hive Tables and Views
Object | Flex Field Name | Flex Field Code | Flex Field Type | Description |
---|---|---|---|---|
DataStore |
|
String |
Number of buckets to be used for clustering |
|
Column |
Hive Partition Column |
|
Numeric |
All partitioning columns are marked as "1". Partition information can come from the following:
|
Column |
Hive Cluster Column |
|
Numeric |
All cluster columns are marked as "1". |
Column |
Hive Sort Column |
|
Numeric |
All sort columns are marked as "1". |
RKM HBase is used to reverse engineer HBase tables. To perform a customized reverse-engineering of HBase tables with RKM HBase, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to HBase tables.
The reverse-engineering process creates the data stores for the corresponding HBase table. You can use the data stores as either a source or a target in a mapping.
Note:
Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.For more information about RKM HBase, see Section A.19, "RKM HBase".
Table 4-3 describes the created flex fields.
Table 4-3 Flex Fields for Reverse-Engineered HBase Tables
Object | Flex Field Name | Flex Field Code | Flex Field Type | Description |
---|---|---|---|---|
DataStore |
|
String |
Comma separated list of Zookeeper nodes. It is used by the HBase client to locate the HBase Master server and HBase Region servers. |
|
Column |
HBase storage type |
|
String |
Defines how a data type is physically stored in HBase. Permitted values are |
HDFS files can be reverse engineered like regular files. To reverse-engineer HDFS files, you must follow the same process as that to reverse-engineer regular files.
Refer to Reverse-engineer a File Model in Connectivity and Knowledge Modules Guide for Oracle Data Integrator for more information.
To load data from the local file system or the HDFS file system into Hive tables:
Create the data stores for local files and HDFS files.
Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring local file data sources.
Create a mapping using the file data store as the source and the corresponding Hive table as the target.
Use the LKM File to Hive LOAD DATA or the LKM File to Hive LOAD DATA Direct knowledge module specified in the physical diagram of the mapping.
These integration knowledge modules load data from flat files into Hive, replacing or appending any existing data.
For more information about the KMs, see the following sections:
To load data from Hive tables to a local file system or a HDFS file:
Create a data store for the Hive tables that you want to load in flat files.
Refer to "Setting Up Hive Data Sources" for information about reverse engineering and configuring Hive data sources.
Create a mapping using the Hive data store as the source and the corresponding File data source as the target.
Use the LKM Hive to File Direct knowledge module, specified in the physical diagram of the mapping.
This integration knowledge module loads data from Hive into flat Files.
For more information about LKM Hive to File Direct, see Section A.12, "LKM Hive to File Direct".
To load data from an HBase table into Hive:
Create a data store for the HBase table that you want to load in Hive.
Refer to "Setting Up HBase Data Sources" for information about reverse engineering and configuring HBase data sources.
Create a mapping using the HBase data store as the source and the corresponding Hive table as the target.
Use the LKM HBase to Hive HBASE-SERDE knowledge module, specified in the physical diagram of the mapping.
This knowledge module provides read access to an HBase table from Hive.
For more information about LKM HBase to Hive HBASE-SERDE, see Section A.10, "LKM HBase to Hive HBASE-SERDE".
To load data from a Hive table into HBase:
Create a data store for the Hive tables that you want to load in HBase.
Refer to "Setting Up Hive Data Sources" for information about reverse engineering and configuring Hive data sources.
Create a mapping using the Hive data store as the source and the corresponding HBase table as the target.
Use the LKM Hive to HBase Incremental Update HBASE-SERDE Direct knowledge module, specified in the physical diagram of the mapping.
This integration knowledge module loads data from Hive into HBase and supports inserting new rows as well as updating existing data.
For more information about LKM Hive to HBase Incremental Update HBASE-SERDE Direct, see Section A.11, "LKM Hive to HBase Incremental Update HBASE-SERDE Direct".
To load data from an SQL Database into a Hive, HBase, and File target:
Create a data store for the SQL source that you want to load into Hive, HBase, or File target.
Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring SQL data sources.
Create a mapping using the SQL source data store as the source and the corresponding HBase table, Hive table, or HDFS files as the target.
Use the IKM SQL to Hive-HBase-File (SQOOP) knowledge module, specified in the physical diagram of the mapping.
This integration knowledge module loads data from a SQL source into Hive, HBase, or Files target. It uses SQOOP to load the data into Hive, HBase, and File targets. SQOOP uses parallel JDBC connections to load the data.
For more information about IKM SQL to Hive-HBase-File (SQOOP), see Section A.23, "IKM SQL to Hive-HBase-File (SQOOP) [Deprecated]".
To load data from an SQL Database into a Hive target:
Create a data store for the SQL source that you want to load into Hive target.
Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring SQL data sources.
Create a mapping using the SQL source data store as the source and the corresponding Hive table as the target.
Use the LKM SQL to Hive SQOOP knowledge module, specified in the physical diagram of the mapping.
This KM loads data from a SQL source into Hive. It uses SQOOP to load the data into Hive. SQOOP uses parallel JDBC connections to load the data.
For more information about LKM SQL to Hive SQOOP, see Section A.1, "LKM SQL to Hive SQOOP".
To load data from an SQL Database into a File target:
Create a data store for the SQL source that you want to load into File target.
Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring SQL data sources.
Create a mapping using the SQL source data store as the source and the corresponding HDFS files as the target.
Use the LKM SQL to File SQOOP Direct knowledge module, specified in the physical diagram of the mapping.
This integration knowledge module loads data from a SQL source into Files target. It uses SQOOP to load the data into File targets. SQOOP uses parallel JDBC connections to load the data.
For more information about IKM SQL to Hive-HBase-File (SQOOP), see Section A.23, "IKM SQL to Hive-HBase-File (SQOOP) [Deprecated]".
To load data from an SQL Database into a HBase target:
Create a data store for the SQL source that you want to load into HBase target.
Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring SQL data sources.
Create a mapping using the SQL source data store as the source and the corresponding HBase table as the target.
Use the LKM SQL to HBase SQOOP Direct knowledge module, specified in the physical diagram of the mapping.
This integration knowledge module loads data from a SQL source into HBase target. It uses SQOOP to load the data into HBase targets. SQOOP uses parallel JDBC connections to load the data.
For more information about LKM SQL to HBase SQOOP Direct, see Section A.3, "LKM SQL to HBase SQOOP Direct".
After loading data into Hive, you can validate and transform the data using the following knowledge modules.
IKM Hive Control Append
For more information, see Section A.7, "IKM Hive Append".
IKM Hive Append
For more information, see Section A.7, "IKM Hive Append".
CKM Hive
For more information, see Section A.25, "CKM Hive (Deprecated)".
IKM Hive Transform
For more information, see Section A.26, "IKM Hive Transform (Deprecated)".
Use the knowledge modules listed in the following table to load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop.
Table 4-4 Knowledge Modules to load data into Oracle Database
Knowledge Module | Use To... |
---|---|
IKM File-Hive to Oracle (OLH-OSCH) |
Load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop. For more information, see Section A.27, "IKM File-Hive to Oracle (OLH-OSCH) [Deprecated]". |
LKM File to Oracle OLH-OSCH |
Load data from an HDFS file into an Oracle staging table using Oracle Loader for Hadoop. For more information, see Section A.14, "LKM File to Oracle OLH-OSCH". |
LKM File to Oracle OLH-OSCH Direct |
Load data from an HDFS file into an Oracle database target using Oracle Loader for Hadoop. For more information, see Section A.15, "LKM File to Oracle OLH-OSCH Direct". |
LKM Hive to Oracle OLH-OSCH |
Load data from a Hive source into an Oracle staging table using Oracle Loader for Hadoop. For more information, see Section A.16, "LKM Hive to Oracle OLH-OSCH". |
LKM Hive to Oracle OLH-OSCH Direct |
Load data from a Hive source into an Oracle database target using Oracle Loader for Hadoop. For more information, see Section A.17, "LKM Hive to Oracle OLH-OSCH Direct". |
Use the knowledge modules listed in the following table to load data from a HDFS file, HBase source, or Hive source into an SQL database target using SQOOP.
Table 4-5 Knowledge Modules to load data into SQL Database
Knowledge Module | Use To... |
---|---|
IKM File-Hive to SQL (SQOOP) |
Load data from an HDFS file or Hive source into an SQL database target using SQOOP. For more information, see Section A.28, "IKM File-Hive to SQL (SQOOP) [Deprecated]". |
LKM HBase to SQL SQOOP |
Load data from an HBase source into an SQL database target using SQOOP. For more information, see Section A.6, "LKM HBase to SQL SQOOP". |
LKM File to SQL SQOOP |
Load data from an HDFS file into an SQL database target using SQOOP. For more information, see Section A.4, "LKM File to SQL SQOOP". |
LKM Hive to SQL SQOOP |
Load data from a Hive source into an SQL database target using SQOOP. For more information, see Section A.5, "LKM Hive to SQL SQOOP". |