This chapter includes the following sections:
The following table summarizes the steps for integrating Hadoop data.
Table 4-1 Integrating Hadoop Data
In the Hadoop context, there is a distinction between files in Hadoop Distributed File System (HDFS) and local files (outside of HDFS).
To define a data source:
Create a Data Server object under File technology.
Create a Physical Schema object for every directory to be accessed.
Create a Logical Schema object for every directory to be accessed.
Create a Model for every Logical Schema.
Create one or more data stores for each different type of file and wildcard name pattern.
For HDFS files, create a Data Server object under File technology by entering the HDFS name node in the field JDBC URL and leave the JDBC Driver name empty. For example:
hdfs://bda1node01.example.com:8020
Test Connection is not supported for this Data Server configuration.
Note:
No dedicated technology is defined for HDFS files.
This topic provides steps in Oracle Data Integrator that are required for connecting to a HDFS system.
The following steps in Oracle Data Integrator are required for connecting to a Hive system. Oracle Data Integrator connects to Hive by using JDBC.
Prerequisites
The Hive technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference
directory.
To set up a Hive data source:
Create a Data Server object under Hive technology.
Set the following locations under JDBC:
JDBC Driver: weblogic.jdbc.hive.HiveDriver
JDBC URL: jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]
For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default
Note:
Usually User ID and Password are provided in the respective fields of an ODI Data Server. In case where a Hive user is defined without password, you must add password=default
as part of the JDBC URL and the password field of Data Server shall be left blank.
Set the following under on the definition tab of the data server:
Hive Metastore URIs: for example, thrift://BDA:10000
Ensure that the Hive server is up and running.
Test the connection to the Data Server.
Create a Physical Schema. Enter the name of the Hive schema in both schema fields of the Physical Schema definition.
Create a Logical Schema object.
Import RKM Hive into Global Objects or a project.
Create a new model for Hive Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM Hive.
Reverse engineered Hive table populates the attribute and storage tabs of the data store.
Prerequisites
The HBase technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference directory.
To set up a HBase data source:
Create a Data Server object under HBase technology.
JDBC Driver and URL are not available for data servers of this technology.
Set the following under on the definition tab of the data server:
HBase Quorum: Quorum of the HBase installation. For example: zkhost1.mydomain.com,zkhost2.mydomain.com,zkhost3.mydomain.com
Ensure that the HBase server is up and running.
Note:
You cannot test the connection to the HBase Data Server.
Create a Physical Schema.
Create a Logical Schema object.
Import RKM HBase into Global Objects or a project.
Create a new model for HBase Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM HBase.
Note:
Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.
At the end of this process, the HBase Data Model contains all the HBase tables with their columns and data types.
This topic provides steps in Oracle Data Integrator that are required for connecting to a Kafka system.
The Kafka technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference directory.
This topic provides steps in Oracle Data Integrator that are required for connecting to a Cassandra system. Oracle Data Integrator connects to Cassandra by using JDBC.
The Cassandra technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE
mode from the xml-reference directory.
You must add all Cassandra-specific flex fields.
Most of the Big Data Knowledge Modules are built-in the product. The exceptions are the RKMs and CKMs, and these will need to be imported into your project or as global objects before you use them.
CKM Hive
RKM Hive
RKM HBase
RKM Cassandra
You must create a ODI Model from a reverse-engineered Hive, HBase, and HDFS Models. The reverse engineering process creates Hive and HBase data stores for the corresponding Hive and HBase tables. You can use these data stores as source or target in your mappings.
This section contains the following topics:
To create a model that is based on the technology hosting Hive, HBase, or HDFS and on the logical schema created when you configured the Hive, HBase, HDFS or File connection, follow the standard procedure described in Oracle Fusion Middleware Developing Integration Projects with Oracle Data Integrator.
For backward compatibility, the Big Data LKMs reading from Files (LKM File to Hive LOAD DATA), also support reading from HDFS, however the source data store must be from a file model. If reading from HDFS, it is preferable to use KMs like the LKM HDFS to File LOAD DATA . In this case, the source data store must be from an HDFS model.
RKM Hive is used to reverse engineer Hive tables and views. To perform a customized reverse-engineering of Hive tables with RKM Hive, follow the usual procedures, as described in Oracle Fusion Middleware Developing Integration Projects with Oracle Data Integrator. This topic details information specific to Hive tables.
The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in a mapping.
For more information about RKM Hive, see RKM Hive.
A storage tab is added to the Hive data store and there is flexibility of how data is stored and formatted within Hive. If the Hive table already exists, you can use the Reverse Engineer process on the Hive model, using the custom Hive RKM to populate the fields.
RKM HBase is used to reverse engineer HBase tables. To perform a customized reverse-engineering of HBase tables with RKM HBase, follow the usual procedures, as described in Oracle Fusion Middleware Developing Integration Projects with Oracle Data Integrator. This topic details information specific to HBase tables.
The reverse-engineering process creates the data stores for the corresponding HBase table. You can use the data stores as either a source or a target in a mapping.
Note:
Ensure that the HBase tables contain some data before performing reverse-engineering. The reverse-engineering operation does not work if the HBase tables are empty.
For more information about RKM HBase, see RKM HBase.
Reverse Engineering HDFS with File Technology
Note:
If the file is large for your local File System, retrieve the first N records from HDFS and place them in a local file.Reverse Engineering HDFS with HDFS Technology
To reverse engineer an HDFS file, perform the following steps:
Create a HDFS data store.
From the Storage Tab, choose from the Storage Format field and corresponding schema file must be specified in the Schema File field.
Click Reverse Engineer operation from the Attributes Tab of the HDFS data store.
Note:
There is no need to import an RKM into the project.HDFS files are used in KMs such as File to Hive, File to Spark and this uses the ODI file technology as a source. You can also use the HDFS LKMs (LKM HDFS File to Hive) and these KMs use the ODI HDFS technology.
Depending on which KMs you want to use, you can select a different technology for the files. Reverse Engineering HDFS will support the Avro, Json, Parquet and delimited formats.
Refer to Reverse-engineer a File Model in Oracle Data Integrator Connectivity and Knowledge Modules Guide for Oracle Data Integrator Developer's Guide for more information.
Creating a Oracle Data Integrator Model from a Reverse-Engineered Hive, HBase, and HDFS Models
RKM Cassandra is used to reverse engineer Cassandra tables. To perform a customized reverse-engineering of Cassandra tables with RKM Cassandra, follow the usual procedures, as described in Oracle Fusion Middleware Developing Integration Projects with Oracle Data Integrator. This topic details information specific to Cassandra tables.
The reverse-engineering process creates the data stores for the corresponding Cassandra table. For more information about RKM Cassandra, see RKM Cassandra.
The KMs support Loading Data from HDFS, however, the preferred way is to use the HDFS KMs, as described in Loading Data from HDFS into Hive.
For more information about the KMs, see the following sections:
To load data from Hive tables to a local file system or a HDFS file:
To load data from an HBase table into Hive:
For more information about LKM HBase to Hive HBASE-SERDE, see LKM HBase to Hive HBASE-SERDE.
For more information about LKM Hive to HBase Incremental Update HBASE-SERDE Direct, see LKM Hive to HBase Incremental Update HBASE-SERDE Direct.
To load data from an SQL Database into a Hive, HBase, and File target:
For more information about IKM SQL to Hive-HBase-File (SQOOP), see IKM SQL to Hive-HBase-File (SQOOP) [Deprecated].
To load data from an SQL Database into a Hive target:
For more information about LKM SQL to Hive SQOOP, see LKM SQL to Hive SQOOP.
To load data from an SQL Database into a File target:
For more information about IKM SQL to Hive-HBase-File (SQOOP), see IKM SQL to Hive-HBase-File (SQOOP) [Deprecated].
To load data from an SQL Database into a HBase target:
For more information about LKM SQL to HBase SQOOP Direct, see LKM SQL to HBase SQOOP Direct.
After loading data into Hive, you can validate and transform the data using the following knowledge modules.
IKM Hive Control Append
For more information, see IKM Hive Append.
IKM Hive Append
For more information, see IKM Hive Append.
IKM Hive Incremental Update
For more information, see IKM Hive Incremental Update.
CKM Hive
For more information, see CKM Hive (Deprecated).
IKM Hive Transform
For more information, see IKM Hive Transform (Deprecated).
Use the knowledge modules listed in the following table to load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop.
Table 4-2 Knowledge Modules to load data into Oracle Database
Knowledge Module | Use To... |
---|---|
IKM File-Hive to Oracle (OLH-OSCH) |
Load data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop. For more information, see IKM File-Hive to Oracle (OLH-OSCH) [Deprecated]. |
LKM File to Oracle OLH-OSCH |
Load data from an HDFS file into an Oracle staging table using Oracle Loader for Hadoop. For more information, see LKM File to Oracle OLH-OSCH. |
LKM File to Oracle OLH-OSCH Direct |
Load data from an HDFS file into an Oracle database target using Oracle Loader for Hadoop. For more information, see LKM File to Oracle OLH-OSCH Direct. |
LKM Hive to Oracle OLH-OSCH |
Load data from a Hive source into an Oracle staging table using Oracle Loader for Hadoop. For more information, see LKM Hive to Oracle OLH-OSCH. |
LKM Hive to Oracle OLH-OSCH Direct |
Load data from a Hive source into an Oracle database target using Oracle Loader for Hadoop. For more information, see LKM Hive to Oracle OLH-OSCH Direct. |
Use the knowledge modules listed in the following table to load data from a HDFS file, HBase source, or Hive source into an SQL database target using SQOOP.
Table 4-3 Knowledge Modules to load data into SQL Database
Knowledge Module | Use To... |
---|---|
IKM File-Hive to SQL (SQOOP) |
Load data from an HDFS file or Hive source into an SQL database target using SQOOP. For more information, see IKM File-Hive to SQL (SQOOP) [Deprecated]. |
LKM HBase to SQL SQOOP |
Load data from an HBase source into an SQL database target using SQOOP. For more information, see LKM HBase to SQL SQOOP. |
LKM File to SQL SQOOP |
Load data from an HDFS file into an SQL database target using SQOOP. For more information, see LKM File to SQL SQOOP. |
LKM Hive to SQL SQOOP |
Load data from a Hive source into an SQL database target using SQOOP. For more information, see LKM Hive to SQL SQOOP. |
Loading data from Kafka to Spark.