This chapter describes how to set up the query processing engines that are supported by Oracle Data Integrator to generate code in different languages.
This chapter includes the following sections:
Hadoop provides a framework for parallel data processing in a cluster. There are different languages that provide a user front-end. Oracle Data Integrator supports the following query processing engines to generate code in different languages:
Hive
The Apache Hive warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin.
Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format.
To generate code in these languages, you need to set up Hive, Pig, and Spark data servers in Oracle Data Integrator. These data servers are to be used as the staging area in your mappings to generate HiveQL, Pig Latin, or Spark code.
Section 2.2, "Generate Code in Different Languages with Oracle Data Integrator"
To set up the Hive data server:
Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Hive and then click New Data Server.
In the Definition tab, specify the details of the Hive data server.
See Section 6.2.1, "Hive Data Server Definition" for more information.
In the JDBC tab, specify the Hive data server connection details.
See Section 6.2.2, "Hive Data Server Connection Details" for more information.
Click Test Connection to test the connection to the Hive data server.
The following table describes the fields that you need to specify on the Definition tab when creating a new Hive data server.
Note: Only the fields required or specific for defining a Hive data server are described.
Table 6-1 Hive Data Server Definition
Field | Description |
---|---|
Name |
Name of the data server that appears in Oracle Data Integrator. |
Data Server |
Physical name of the data server. |
User/Password |
Hive user with its password. |
Metastore URI |
Hive Metastore URIs: for example, |
Hadoop Data Server |
Hadoop data server that you want to associate with the Hive data server. |
Additional Classpath |
Additional classpaths. |
The following table describes the fields that you need to specify on the JDBC tab when creating a new Hive data server.
Note: Only the fields required or specific for defining a Hive data server are described.
Table 6-2 Hive Data Server Connection Details
Field | Description |
---|---|
JDBC Driver |
Use this JDBC driver to connect to the Hive Data Server. The driver documentation is available at the following URL: |
JDBC URL |
For example, |
Create a Hive physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.
Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.
To set up the Pig data server:
Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Pig and then click New Data Server.
In the Definition tab, specify the details of the Pig data server.
See Section 6.4.1, "Pig Data Server Definition" for more information.
In the Properties tab, add the Pig data server properties.
See Section 6.4.2, "Pig Data Server Properties" for more information.
Click Test Connection to test the connection to the Pig data server.
The following table describes the fields that you need to specify on the Definition tab when creating a new Pig data server.
Note: Only the fields required or specific for defining a Pig data server are described.
Table 6-3 Pig Data Server Definition
Field | Description |
---|---|
Name |
Name of the data server that will appear in Oracle Data Integrator. |
Data Server |
Physical name of the data server. |
Process Type |
Choose one of the following:
|
Hadoop Data Server |
Hadoop data sever that you want to associate with the Pig data server. Note: This field is displayed only when the MapReduce Mode option is set to Process Type. |
Additional Classpath |
Specify additional classpaths. Add the following additional classpaths:
For pig-hcatalog-hive, add the following classpath in addition to the ones mentioned above:
|
User/Password |
Pig user with its password. |
The following table describes the Pig data server properties that you need to add on the Properties tab when creating a new Pig data server.
Create a Pig physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.
Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.
To set up the Spark data server:
Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Spark Python and then click New Data Server.
In the Definition tab, specify the details of the Spark data server.
See Section 6.6.1, "Spark Data Server Definition" for more information.
Click Test Connection to test the connection to the Spark data server.
The following table describes the fields that you need to specify on the Definition tab when creating a new Spark Python data server.
Note: Only the fields required or specific for defining a Spark Python data server are described.
Create a Spark physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.
Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.
By default, Oracle Data Integrator generates HiveQL code. To generate Pig Latin or Spark code, you must use the Pig data server or the Spark data server as the staging location for your mapping.
Before you generate code in these languages, ensure that the Hive, Pig, and Spark data servers are set up.
For more information see the following sections:
Section 6.2, "Setting Up Hive Data Server"
Section 6.4, "Setting Up Pig Data Server"
Section 6.6, "Setting Up Spark Data Server"
To generate code in different languages:
Open your mapping.
To generate HiveQL code, run the mapping with the default staging location (Hive).
To generate Pig Latin or Spark code, go to the Physical diagram and do one of the following:
To generate Pig Latin code, set the Execute On Hint option to use the Pig data server as the staging location for your mapping.
To generate Spark code, set the Execute On Hint option to use the Spark data server as the staging location for your mapping.
Execute the mapping.
Section 6.1, "Query Processing Engines Supported by Oracle Data Integrator"
Section 2.2, "Generate Code in Different Languages with Oracle Data Integrator"