1 Big Data Integration with Oracle Data Integrator

This chapter provides an overview of Big Data integration using Oracle Data Integrator. It also provides a compatibility matrix of the supported Big Data technologies.

This chapter includes the following sections:

Section 1.1, "Overview of Hadoop Data Integration"
Section 1.2, "Big Data Knowledge Modules Matrix"

1.1 Overview of Hadoop Data Integration

Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator can be used to design the 'what' of an integration flow and assign knowledge modules to define the 'how' of the flow in an extensible range of mechanisms. The 'how' is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.

Employing familiar and easy-to-use tools and pre-configured knowledge modules (KMs), Oracle Data Integrator lets you to do the following:

Load data into Hadoop directly from Files or SQL databases.

For more information, see Section 4.1, "Integrating Hadoop Data".
Validate and transform data within Hadoop with the ability to make the data available in various forms such as Hive, HBase, or HDFS.

For more information, see Section 4.15, "Validating and Transforming Data Within Hive".
Load the processed data from Hadoop into Oracle database, SQL database, or Files.

For more information, see Section 4.1, "Integrating Hadoop Data".
Execute integration projects as Oozie workflows on Hadoop.

For more information, see Section 5.1, "Executing Oozie Workflows with Oracle Data Integrator".
Audit Oozie workflow execution logs from within Oracle Data Integrator.

For more information, see Section 5.5, "Auditing Hadoop Logs".
Generate code in different languages for Hadoop, such as HiveQL, Pig Latin, or Spark Python.

For more information, see Section 6.8, "Generating Code in Different Languages"

1.2 Big Data Knowledge Modules Matrix

Depending on the source and target technologies, you can use the KMs shown in the following table in your integration projects. You can also use a combination of these KMs. For example, to read data from SQL into Spark, you can load the data first in HDFS using LKM SQL to File Direct, and then use LKM File to Spark to continue.

The following table shows the Big Data KMs that Oracle Data Integrator provides to integrate data between different source and target technologies.

Table 1-1 Big Data Knowledge Modules

Source	Target	Knowledge Module
OS File	HDFS File	-
	Hive	LKM File to Hive LOAD DATA Direct
	HBase	-
	Pig	LKM File to Pig
	Spark	LKM File to Spark
Generic SQL	HDFS File	LKM SQL to File SQOOP Direct
	Hive	LKM SQL to Hive SQOOP
	HBase	LKM SQL to HBase SQOOP Direct
	Pig	-
	Spark	-
HDFS File	OS File	-
	Generic SQL	LKM File to SQL SQOOP
	Oracle SQL	LKM File to Oracle OLH-OSCH Direct
	HDFS File	-
	Hive	LKM File to Hive LOAD DATA Direct
	HBase	-
	Pig	LKM File to Pig
	Spark	LKM File to Spark
Hive	OS File	LKM Hive to File Direct
	Generic SQL	LKM Hive to SQL SQOOP
	Oracle SQL	LKM Hive to Oracle OLH-OSCH Direct
	HDFS File	LKM Hive to File Direct
	Hive	IKM Hive Append
	HBase	LKM Hive to HBase Incremental Update HBASE-SERDE Direct
	Pig	LKM Hive to Pig
	Spark	LKM Hive to Spark
HBase	OS File	-
	Generic SQL	LKM HBase to SQL SQOOP
	Oracle SQL	-
	HDFS File	-
	Hive	LKM HBase to Hive HBASE-SERDE
	HBase	-
	Pig	LKM HBase to Pig
	Spark	-
Pig	OS File	LKM Pig to File
	Generic SQL	LKM SQL to Pig SQOOP
	Oracle SQL	-
	HDFS File	LKM Pig to File
	Hive	LKM Pig to Hive
	HBase	LKM Pig to HBase
	Pig	-
	Spark	-
Spark	OS File	LKM Spark to File
	Generic SQL	-
	Oracle SQL	-
	HDFS File	LKM Spark to File
	Hive	LKM Spark to Hive
	HBase	-
	Pig	-
	Spark	-