10 Using the Metadata Provider

This chapter explains the Metadata Provider functionality, different types of Metadata Providers, and examples that can be used to understand the functionality.

Topics:

About the Metadata Provider
Avro Metadata Provider
Java Database Connectivity Metadata Provider
Hive Metadata Provider

10.1 About the Metadata Provider

The Metadata Provider is valid only if handlers are configured to run with a Replicat process.

The Replicat process provides functionality to perform source table to target table and source column to target column mapping using syntax in the Replicat configuration file. The source metadata definitions are included in the Oracle GoldenGate trail file (or by source definitions files for Oracle GoldenGate releases 12.2 and later). When the replication target is a database, the Replicat process obtains the target metadata definitions from the target database. However, this is a shortcoming when pushing data to Big Data applications or Java Delivery in general. Big Data applications generally provide no target metadata so the Replicat mapping is not possible. The Metadata Provider exists to address this deficiency. The Metadata Provider can be used to define target metadata using either Avro or Hive which in turn enables source table to target table and source column to target column Replicat mapping.

The use of the Metadata Provider is optional and is enabled if the gg.mdp.type property is specified in the Java Adapter Properties file.. If the metadata included in the source Oracle GoldenGate trail file is acceptable for the output, then do not use the Metadata Provider. The Metadata Provider should be used in the following cases:

The requirement is for mapping source table names into target table names that do not match.
The requirement is for mapping of source column names into target column name that do not match.
The requirement is for the inclusion of certain columns from the source trail file and omitting other columns.

Replicat mapping has a general limitation in that the mapping defined in the Replicat configuration file is static. Oracle GoldenGate 12.2 and later provides functionality for DDL propagation when using an Oracle Database as the source. The proper handling of schema evolution can be problematic when the Metadata Provider and Replicat mapping are being used. You should consider your use cases for schema evolution and plan for how you want to update the Metadata Provider and the Replicat mapping syntax for required changes.

For every table mapped in REPLICAT using COLMAP, the metadata will be retrieved from a configured metadata provider and retrieved metadata will then be used by REPLICAT for column mapping functionality.

You have choice of configuring one Metadata Provider implementation. Currently Hive and Avro Metadata Providers are supported.

Scenarios - When to use Metadata Provider

The following scenarios do not require the Metadata Provider to be configured:

The mapping of schema name whereby the source schema named GG is mapped to the target schema named GGADP.*

The mapping of schema and table name whereby the schema GG.TCUSTMER is mapped to the table name GGADP.TCUSTMER_NEW
```
MAP GG.*, TARGET GGADP.*;
(OR)
MAP GG.TCUSTMER, TARGET GG_ADP.TCUSTMER_NEW;
```
The following scenario requires Metadata Provider to be configured:

The mapping of column names whereby the source column name does not match the target column name. For example source column CUST_CODE mapped to target column CUST_CODE_NEW
```
MAP GG.TCUSTMER, TARGET GG_ADP.TCUSTMER_NEW, COLMAP(USEDEFAULTS, CUST_CODE_NEW=CUST_CODE, CITY2=CITY);
```

10.2 Avro Metadata Provider

The Avro Metadata Provider is used to retrieve the table metadata from Avro Schema files. For every table mapped in REPLICAT using COLMAP, the metadata will be retrieved from Avro Schema and retrieved metadata will then be used by REPLICAT for column mapping.

This section contains the following:

Detailed Functionality
Runtime Prerequisites
Classpath Configuration
Avro Metadata Provider Configuration
Sample Configuration
Metadata Change Event
Limitations
Troubleshooting

10.2.1 Detailed Functionality

The Avro Metadata Provider uses Avro schema definition files to retrieve metadata. The Avro schemas are defined using the JSON. For each table mapped in process_name.prm file, a corresponding The Avro schema definition file should be created. More information on defining Avro schemas is found at:

http://avro.apache.org/docs/current/gettingstartedjava.html#Defining+a+schema

Avro Metadata Provider Schema definition syntax:

{"namespace": "[$catalogname.]$schemaname",
"type": "record",
"name": "$tablename",
"fields": [
     {"name": "$col1", "type": "$datatype"},
     {"name": "$col2 ",  "type": "$datatype ", "primary_key":true}, 
     {"name": "$col3", "type": "$datatype ", "primary_key":true}, 
     {"name": "$col4", "type": ["$datatype","null"]}   
   ]
}
 
namespace            - name of catalog/schema being mapped
name                 - name of the table being mapped
fields.name          - array of column names
fields.type          - datatype of the column
fields.primary_key   - indicates the column is part of primary key.

Representing nullable and not nullable columns:

"type":"$datatype" - indicates the column is not nullable, where "$datatype" is the actual datatype.
"type": ["$datatype","null"] - indicates the column is nullable, where "$datatype" is the actual datatype

The file naming convention for Avro schema files accessed by the Avro Metadata Provider must be in the following format:

[$catalogname.]$schemaname.$tablename.mdp.avsc
 
$catalogname    - name of the catalog if exists
$schemaname   - name of the schema
$tablename        - name of the table
.mdp.avsc           -  constant, which should be appended always

Supported Avro Data Types:

boolean
bytes
double
float
int
long
string

For more information on Avro data types, see https://avro.apache.org/docs/1.7.5/spec.html#schema_primitive.

10.2.2 Runtime Prerequisites

The Avro schema definitions should be created for all tables mapped in Replicat's parameter file before starting the Replicat process.

10.2.3 Classpath Configuration

There is no additional classpath setting required for Avro Metadata Provider.

10.2.4 Avro Metadata Provider Configuration

The configuration properties of Oracle GoldenGate Avro Metadata Provider are detailed in this section.

Property	Required/Optional	Legal Values	Default	Explanation
`gg.mdp.type`	Required	`avro`	`-`	Selects the Avro Metadata Provider
`gg.mdp.schemaFilesPath`	Required	Example for a legal value could be `/home/user/ggadp/avroschema/`	`-`	Path to Avro schema files directory
`gg.mdp.charset`	Optional	Valid character set	`UTF-8`	Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target character set.
`gg.mdp.nationalCharset`	Optional	Valid character set	`UTF-8`	Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target character set. Example: Indicates character set of columns like `NCHAR`, `NVARCHAR`with an Oracle Database.

10.2.5 Sample Configuration

This section provides an example for configuring the Avro Metadata Provider. Consider a source with following table:

TABLE GG.TCUSTMER {
     CUST_CODE VARCHAR(4) PRIMARY KEY,
     NAME VARCHAR(100),
     CITY VARCHAR(200),
     STATE VARCHAR(200)
}

Mapping columnCUST_CODE (GG.TCUSTMER) in source to CUST_CODE2 (GG_AVRO.TCUSTMER_AVRO) on target and column CITY (GG.TCUSTMER) in source to CITY2 (GG_AVRO.TCUSTMER_AVRO) on target. Thus, the mapping in process_name.prm file is:

MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY);

Mapping definition in this example:

Source schema GG is mapped to target schema GG_AVRO.
Source column CUST_CODE is mapped to target column CUST_CODE2.
Source column CITY is mapped to target column CITY2.
USEDEFAULTS specifies that rest of the columns names are same on both source and target (NAME and STATE columns).

The Avro schema definition file for the preceding example:

File path: /home/ggadp/avromdpGG_AVRO.TCUSTMER_AVRO.mdp.avsc

{"namespace": "GG_AVRO",
"type": "record",
"name": "TCUSTMER_AVRO",
"fields": [
     {"name": "NAME", "type": "string"},
    {"name": "CUST_CODE2",  "type": "string", "primary_key":true},
     {"name": "CITY2", "type": "string"},
     {"name": "STATE", "type": ["string","null"]}
]
}

The configuration in the Java Adapter properties file includes the following:

gg.mdp.type = avro
gg.mdp.schemaFilesPath = /home/ggadp/avromdp

Following is the sample output using delimited text formatter with a semi-colon as the delimiter for the preceding example.

I;GG_AVRO.TCUSTMER_AVRO;2013-06-02 22:14:36.000000;NAME;BG SOFTWARE CO;CUST_CODE2;WILL;CITY2;SEATTLE;STATE;WA

The Oracle GoldenGate for Big Data installation include a sample Replicat configuration file, a sample Java Adapter properties file, and sample Avro schemas at:

GoldenGate_install_directory/AdapterExamples/big-data/metadata_provider/avro

10.2.6 Metadata Change Event

The Avro schema definitions and the mappings in the Replicat configuration file may need to be modified if there is a DDL change in the source database tables. You may want to stop or suspend the Replicat process in the case of a metadata change event. The Replicat process can be stopped by adding the following to the Replicat configuration file (process_name.prm):

DDL INCLUDE ALL, EVENTACTIONS (ABORT)

Alternatively, the Replicat process can be suspended by adding the following to the Replication configuration file.

DDL INCLUDE ALL, EVENTACTIONS (SUSPEND)

10.2.7 Limitations

Avro bytes data type cannot be used as primary key.

The source to target mapping defined in the Replicat configuration file is static. Oracle GoldenGate 12.2 and later supports DDL propagation and source schema evolution for Oracle Databases as the replication source. However, evolution of the source schemas may be problematic the static mapping configured in the Replicat configuration file.

10.2.8 Troubleshooting

This section contains the following:

10.2.8.1 Invalid Schema Files Location

The Avro schema files directory location specified by the configuration property gg.mdp.schemaFilesPath should be a valid directory. Failure to configure a valid directory in gg.mdp.schemaFilesPath property leads to following exception:

oracle.goldengate.util.ConfigException: Error initializing Avro metadata provider
Specified schema location does not exist. {/path/to/schema/files/dir}

10.2.8.2 Invalid Schema File Name

For every table mapped in the process_name.prm file, a corresponding Avro schema file must be created in the directory specified in gg.mdp.schemaFilesPath.

For example, consider the following scenario:

Mapping:

MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2=cust_code, CITY2 = CITY);

Property:

gg.mdp.schemaFilesPath=/home/usr/avro/

A file called GG_AVRO.TCUSTMER_AVRO.mdp.avsc must be created in the /home/usr/avro/ directory. that is, /home/usr/avro/GG_AVRO.TCUSTMER_AVRO.mdp.avsc

Failing to create the /home/usr/avro/GG_AVRO.TCUSTMER_AVRO.mdp.avsc file results in the following exception:

java.io.FileNotFoundException: /home/usr/avro/GG_AVRO.TCUSTMER_AVRO.mdp.avsc

10.2.8.3 Invalid Namespace in Schema File

The target schema name specified in REPLICAT mapping must be same as namespace in the Avro schema definition file.

For example, consider the following scenario:

Mapping:

MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2 = cust_code, CITY2 = CITY);
 
Avro Schema Definition:
 
{
"namespace": "GG_AVRO",
..
}

In this scenario, REPLICAT abends with following exception if the target schema name specified in Replicat mapping does not match with Avro schema namespace:

Unable to retrieve table matadata. Table : GG_AVRO.TCUSTMER_AVRO
Mapped [catalogname.]schemaname (GG_AVRO) does not match with the schema namespace {schema namespace}

10.2.8.4 Invalid Table Name in Schema File

The target table name specified in Replicat mapping must be same as name in the Avro schema definition file.

For example, consider the following scenario:

Mapping:

MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2 = cust_code, CITY2 = CITY);

Avro Schema Definition:

{
"namespace": "GG_AVRO",
"name": "TCUSTMER_AVRO",
..
}

In this scenario, REPLICAT abends with following exception if the target table name specified in Replicat mapping does not match with Avro schema name.

Unable to retrieve table matadata. Table : GG_AVRO.TCUSTMER_AVRO
Mapped table name (TCUSTMER_AVRO) does not match with the schema table name {table name}

10.3 Java Database Connectivity Metadata Provider

The Java Database Connectivity (JDBC) Metadata Provider is used to retrieve the table metadata from any target database that supports a JDBC connection and has a database schema. The JDBC Metadata Provider should be the preferred metadata provider for any target database that is an RDBMS, although there are various other non-RDBMS targets that also provide a JDBC driver

This section contains the following:

JDBC Detailed Functionality
Java Classpath
JDBC Metadata Provider Configuration
Sample Configuration

10.3.1 JDBC Detailed Functionality

The JDBC Metadata Provider uses the JDBC Driver provided with your target database. The metadata is retrieved using the JDBC Driver for every target table mapped in the Replicat properties file. Replicat processes use the retrieved target metadata for the column mapping functionality.

You can enable this feature for JDBC Handler by configuring the REPERROR in your Replicat parameter file. In addition, you need to define the error codes specific to your RDBMS JDBC target in the JDBC Handler properties file as follows:

Table 10-1 JDBC REPERROR Codes

Property	Value	Required
gg.error.duplicateErrorCodes	Comma-separated integer values of error codes that mean duplicate errors	No
gg.error.notFoundErrorCodes	Comma-separated integer values of error codes that mean duplicate errors	No
gg.error.deadlockErrorCodes	Comma-separated integer values of error codes that mean duplicate errors	No

For example:

#ErrorCode
gg.error.duplicateErrorCodes=1062,1088,1092,1291,1330,1331,1332,1333
gg.error.notFoundErrorCodes=0
gg.error.deadlockErrorCodes=1213

To understand how the various JDBC types are mapped to database-specific SQL types, review the specifics at:

https://docs.oracle.com/javase/6/docs/technotes/guides/jdbc/getstart/mapping.html#table1

10.3.2 Java Classpath

The JDBC Java Driver location must be included in the class path of the handler using the gg.classpath property.

For example, the configuration for a MySQL database could be:

gg.classpath= /path/to/jdbc/driver/jar/mysql-connector-java-5.1.39-bin.jar

10.3.3 JDBC Metadata Provider Configuration

The following are the configurable values for the JDBC Metadata Provider. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

Table 10-2 JDBC Metadata Provider Properties

Properties	Required/ Optional	Legal Values	Default	Explanation
`gg.mdp.type`	Required	`jdbc`	None	Entering `jdbc` at a command prompt activates the use of the JDBC Metadata Provider.
`gg.mdp.ConnectionUrl`	Required	`jdbc:subprotocol:subname`	None	The target database JDBC URL.
`gg.mdp.DriverClassName`	Required	Java class name of the JDBC driver	None	The fully qualified Java class name of the JDBC driver.
`gg.mdp.userName`	Optional	A legal username string.	None	The user name for the JDBC connection. Alternatively, you can provide the user name using the `ConnectionURL` property.
`gg.mdp.password`	Optional	A legal password string.	None	The password for the JDBC connection. Alternatively, you can provide the user name using the `ConnectionURL` property.

10.3.4 Sample Configuration

This section provides examples for configuring the JDBC Metadata Provider.

MySQL Driver configuration:

gg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:oracle:thin:@myhost:1521:orcl
gg.mdp.DriverClassName=oracle.jdbc.driver.OracleDriver
gg.mdp.UserName=username
gg.mdp.Password=password

Oracle Thin Driver configuration:

gg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:mysql://localhost/databaseName?user=username&password=password
gg.mdp.DriverClassName=com.mysql.jdbc.Driver
gg.mdp.UserName=username
gg.mdp.Password=password

Oracle OCI Driver configuration:

ggg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:oracle:oci:@myhost:1521:orcl
gg.mdp.DriverClassName=oracle.jdbc.driver.OracleDriver
gg.mdp.UserName=username
gg.mdp.Password=password

Oracle Teradata Driver configuration:

gg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:teradata://10.111.11.111/USER=username,PASSWORD=password
gg.mdp.DriverClassName=com.teradata.jdbc.TeraDrivergg.mdp.UserName=username
gg.mdp.UserName=username
gg.mdp.Password=password

10.4 Hive Metadata Provider

The Hive Metadata Provider is used to retrieve the table metadata from a Hive metastore. The metadata will be retrieved from Hive for every target table mapped in the Replicat properties file using the COLMAP parameter. The retrieved target metadata is used by Replicat for the column mapping functionality.

This sections contains the following:

Detailed Functionality
Configuring Hive with a Remote Metastore Database
Classpath Configuration
Hive Metadata Provider Configuration
Sample Configuration
Security
Metadata Change Event
Limitations
Additional Considerations
Troubleshooting

10.4.1 Detailed Functionality

The Hive Metadata Provider uses both Hive JDBC and HCatalog interfaces to retrieve metadata from the Hive metastore. For each table mapped in the process_name.prm file, a corresponding table should be created in Hive.

The default Hive configuration starts an embedded and local metastore Derby database. Apache Derby is designed to be an embedded database and only allows a single connection. The single connection limitation of the Derby Database as the Hive Metastore implementation means that it cannot function when working with the Hive Metadata Provider. To overcome this, you must configure Hive with a remote metastore database. More information on configuring Hive with remote metastore database can found at:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase

Hive does not support Primary Key semantics, so the metadata retrieved from Hive metastore will not include any primary key definition. Replicat's KEYCOLS parameter should be used to define primary keys when you use the Hive Metadata Provider.

KEYCOLS

The Replicat mapping KEYCOLS parameter must be used to define primary keys in the target schema. The Oracle GoldenGate HBase Handler requires primary keys. Therefore, setting primary keys in the target schema is required when Replicat mapping is employed with HBase as the target.

Additionally, the output of the Avro Formatters includes an Array field to hold the primary column names. If Replicat mapping is employed with the Avro Formatters you should consider using KEYCOLS to identify the primary key columns.

Examples of configuring KEYCOLS is described in Sample Configuration.

Supported Hive Data types:

BIGINT
BINARY
BOOLEAN
CHAR
DATE
DECIMAL
DOUBLE
FLOAT
INT
SMALLINT
STRING
TIMESTAMP
TINYINT
VARCHAR

For more information on Hive data types, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.

10.4.2 Configuring Hive with a Remote Metastore Database

A list of supported databases that can be used to configure remote Hive metastore can be found at https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-SupportedBackendDatabasesforMetastore.

In the following example, a MySQL database is configured as the Hive metastore using the following properties in the ${HIVE_HOME}/conf/hive-site.xml Hive configuration file:

Note:

The ConnectionURL and driver class used in this example are specific to MySQL database. Change the values appropriately if any database other than MySQL is chosen.

<property>
         <name>javax.jdo.option.ConnectionURL</name>        
         <value>jdbc:mysql://MYSQL_DB_IP:MYSQL_DB_PORT/DB_NAME?createDatabaseIfNotExist=false</value>
 </property>
 
 <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
 </property>
 
 <property>
          <name>javax.jdo.option.ConnectionUserName</name>
     <value>MYSQL_CONNECTION_USERNAME</value>
 </property>
 
 <property>
         <name>javax.jdo.option.ConnectionPassword</name>
         <value>MYSQL_CONNECTION_PASSWORD</value>
 </property>

The list of parameters to be configured in the hive-site.xml file for a remote metastore can be found at https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase.

Note:

MySQL JDBC connector JAR must be added in the Hive classpath:

In HIVE_HOME/lib/ directory. DB_NAME should be replaced by a valid database name created in MySQL.
Start the Hive Server:

HIVE_HOME/bin/hiveserver2/bin/hiveserver2
Start the Hive Remote Metastore Server:

HIVE_HOME/bin/hive --service metastore

10.4.3 Classpath Configuration

You must configure two things in the gg.classpath configuration variable in order for the Hive Metadata Provider to connect to Hive and run. The first is the hive-site.xml file and the second are the Hive and HDFS client jars. The client JARs must match the version of Hive that the Hive Metadata Provider is connecting.

Create hive-site.xml file with the following properties:

<configuration>
<!-- Mandatory Property --> 
<property>
<name>hive.metastore.uris</name>
<value>thrift://HIVE_SERVER_HOST_IP:9083</value>
<property>
 
<!-- Optional Property. Default value is 5 -->
<property>
<name>hive.metastore.connect.retries</name>
<value>3</value>
</property>
 
<!-- Optional Property. Default value is 1 -->
<property>
<name>hive.metastore.client.connect.retry.delay</name>
<value>10</value>
</property>
 
<!-- Optional Property. Default value is 600 seconds -->
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>50</value>
</property>

 </configuration>

Note:

For example, if the hive-site.xml file is created in the /home/user/oggadp/dirprm directory, then gg.classpath entry is gg.classpath=/home/user/oggadp/dirprm/

The default location of the Hive and HDFS client jars are the following directories:
```
HIVE_HOME/hcatalog/share/hcatalog/*
HIVE_HOME/lib/*
HIVE_HOME/hcatalog/share/webhcat/java-client/*
HADOOP_HOME/share/hadoop/common/*
HADOOP_HOME/share/hadoop/common/lib/*
HADOOP_HOME/share/hadoop/mapreduce/*
```
Configure the gg.classpath exactly as shown in the preceding step. Creating a path to the hive-site.xml should contain the path with no wildcard appended. The inclusion of the * wildcard in the path to the hive-site.xml file causes it not to be picked up. Conversely, creating a path to the dependency JARs should include the * wildcard character to include all of the JAR files in that directory in the associated classpath. Do not use *.jar.

10.4.4 Hive Metadata Provider Configuration

The configuration properties of the Hive Metadata Provider are detailed in this section.

Property	Required/Optional	Legal Values	Default	Explanation
`gg.mdp.type`	Required	`hive`	`-`	Selects Hive Metadata Provider
`gg.mdp.connectionUrl`	Required	Format without Kerberos Authentication: j`dbc:hive2://HIVE_SERVER_IP:HIVE_JDBC_PORT/HIVE_DB` Format with Kerberos Authentication: `jdbc:hive2://HIVE_SERVER_IP:HIVE_JDBC_PORT/HIVE_DB;` `principal=user/FQDN@MY.REALM`	`-`	JDBC Connection URL of Hive Server
`gg.mdp.driverClassName`	Required	`org.apache.hive.jdbc.HiveDriver`	`-`	Fully qualified Hive JDBC Driver class name.
`gg.mdp.userName`	Optional	Valid username	`""`	User name to connect to the Hive Database. The `userName` property is not required when Kerberos Authentication is used. The Kerberos principal should be specified in the connection URL as specified in `connectionUrl` property's legal values.
`gg.mdp.password`	Optional	Valid Password	`""`	Password to connect to Hive Database
`gg.mdp.charset`	Optional	Valid character set	`UTF-8`	Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target character set.
`gg.mdp.nationalCharset`	Optional	Valid character set	`UTF-8`	Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target character set. For example, Indicates character set of columns like `NCHAR`, `NVARCHAR` in an Oracle Database.
`gg.mdp.authType`	Optional	kerberos	none
`gg.mdp.kerberosKeytabFile`	Optional (Required if `authType=kerberos`)	Relative or absolute path to a Kerberos keytab file.	`-`	The `keytab` file allows Hive to access a password to perform `kinit` operation for Kerberos security.
`gg.mdp.kerberosPrincipal`	Optional (Required if `authType=kerberos`)	A legal Kerberos principal name(`user/FQDN@MY.REALM`)	`-`	The Kerberos principal name for Kerberos authentication.

10.4.5 Sample Configuration

The following is an example for configuring the Hive Metadata Provider. Consider a source with following table:

TABLE GG.TCUSTMER {
     CUST_CODE VARCHAR(4)   PRIMARY KEY,
     NAME VARCHAR(100),
     CITY VARCHAR(200),
     STATE VARCHAR(200)}

The example maps the column CUST_CODE (GG.TCUSTMER) in the source to CUST_CODE2 (GG_HIVE.TCUSTMER_HIVE) on the target and column CITY (GG.TCUSTMER) in the source to CITY2 (GG_HIVE.TCUSTMER_HIVE)on the target.

Mapping configuration in the process_name.prm file:

MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2);

The mapping definition for this example is:

Source schema GG is mapped to target schema GG_HIVE
Source column CUST_CODE is mapped to target column CUST_CODE2
Source column CITY is mapped to target column CITY2
USEDEFAULTS specifies that rest of the columns names are same on both source and target (NAME and STATE columns).
KEYCOLS is used to specify that CUST_CODE2 should be treated as primary key.

Since primary keys cannot be specified in Hive DDL, the KEYCOLS parameter is used to specify the primary keys.

Create schema and tables in Hive for the preceding example:

Note:

You can choose any schema name and are not restricted to the gg_hive schema name. The Hive schema can be pre-existing or newly created. You do this by modifying the connection URL (gg.mdp.connectionUrl) in the Java Adapter properties file and the mapping configuration in the Replicat.prm file. Once the schema name is changed, the connection URL (gg.mdp.connectionUrl) and mapping in the Replicat.prm file should be updated.

To start the Hive CLI type the following command:

HIVE_HOME/bin/hive

To create a schema, GG_HIVE, in Hive, use the following command:

hive> create schema gg_hive;
OK
Time taken: 0.02 seconds

To create a table TCUSTMER_HIVE in GG_HIVE database type the following command:

hive> CREATE EXTERNAL TABLE `TCUSTMER_HIVE`(
    >   "CUST_CODE2" VARCHAR(4),
    >   "NAME" VARCHAR(30),
    >   "CITY2" VARCHAR(20),
    >   "STATE" STRING);
OK
Time taken: 0.056 seconds

Configuration in the .properties file can be like the following:

gg.mdp.type=hive
gg.mdp.connectionUrl=jdbc:hive2://HIVE_SERVER_IP:10000/gg_hive
gg.mdp.driverClassName=org.apache.hive.jdbc.HiveDriver

Following is the sample output using delimited text formatter with a comma as the delimiter for the preceding example.

I;GG_HIVE.TCUSTMER_HIVE;2015-10-07T04:50:47.519000;cust_code2;WILL;name;BG SOFTWARE CO;city2;SEATTLE;state;WA

A sample Replicat configuration file, Java Adapter properties file, and a Hive create table SQL script are included with the installation, and located at:

GoldenGate_install_directory/AdapterExamples/big-data/metadata_provider/hive

10.4.6 Security

The Hive server can be secured using Kerberos Authentication. Refer to the Hive documentation for your specific Hive release for instructions on how to secure the Hive server. The Hive Metadata Provider can connect to a Kerberos secured Hive server.

The HDFS core-site.xml and hive-site.xml should be in handler's classpath.

Enable the following properties in core-site.xml:

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> 
</property>
 
<property> 
<name>hadoop.security.authorization</name> 
<value>true</value> 
</property>

Enable the following properties in hive-site.xml

<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
 
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/path/to/keytab</value> <!-- Change this value -->
</property>
 
<property>
<name>hive.metastore.kerberos.principal</name>
<value>Kerberos Principal</value> <!-- Change this value -->
</property>
 
<property>
   <name>hive.server2.authentication</name>
    <value>KERBEROS</value>
</property>
 
<property>
   <name>hive.server2.authentication.kerberos.principal</name>
    <value>Kerberos Principal</value> <!-- Change this value -->
</property>
 
<property>
    <name>hive.server2.authentication.kerberos.keytab</name>
    <value>/path/to/keytab</value> <!-- Change this value -->
</property>

10.4.7 Metadata Change Event

Tables in Hive metastore should be updated/altered/created manually if there is a change in source database tables. You may wish to abort or suspend the Replicat process in the case of a metadata change event. The Replicat process can be aborted by adding the following to the Replicat configuration file (process_name.prm):

DDL INCLUDE ALL, EVENTACTIONS (ABORT)

Alternatively, the Replicat process can be suspended by adding the following to the Replication configuration file (process_name.prm):

DDL INCLUDE ALL, EVENTACTIONS (SUSPEND)

10.4.8 Limitations

Columns with binary data type cannot be used as primary key.

10.4.9 Additional Considerations

The most common problems encountered are the Java classpath issues. The Hive Metadata Provider requires certain Hive and HDFS client libraries to be resolved in its classpath as a prerequisite.

The required client JAR directories are listed in Classpath Configuration. Hive and HDFS client jars do not ship with Oracle GoldenGate for Big Data product. The client JARs should be the same version as the Hive version to which Hive Metadata Provider is connecting.

To establish a connection to the Hive server, the hive-site.xml file must be in the classpath.

10.4.10 Troubleshooting

The Replicat process will abend with a "Table metadata resolution exception" if the mapped target table does not exist in Hive.

For example, consider the following mapping:

MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2);

The preceding mapping requires a table called TCUSTMER_HIVE to be created in schema GG_HIVE in the Hive metastore. Failure to create the GG_HIVE.TCUSTMER_HIVE table in Hive will result in following exception:

ERROR [main) - Table Metadata Resolution Exception
Unable to retrieve table matadata. Table : GG_HIVE.TCUSTMER_HIVE
NoSuchObjectException(message:GG_HIVE.TCUSTMER_HIVE table not found)