7.2 Hive Metadata Provider

The Hive Metadata Provider is used to retrieve the table metadata from a Hive metastore. The metadata will be retrieved from Hive for every target table mapped in the Replicat properties file using the COLMAP syntax. The retrieved target metadata will be used by Replicat for the column mapping functionality.

This sections contains the following:

7.2.1 Detailed Functionality

The Hive Metadata Provider uses both Hive JDBC and HCatalog interfaces to retrieve metadata from the Hive metastore. For each table mapped in Replicat.prm file, a corresponding table should be created in Hive.

The default Hive configuration starts an embedded/local metastore Derby database. Apache Derby is designed to be an embedded database and only allows a single connection. The single connection limitation of the Derby Database as the Hive Metastore implementation means that it will not function when working with the Oracle GoldenGate for Big Data Hive Metadata Provider. To overcome this limitation Hive should be configured with a remote metastore database. More information on configuring Hive with remote metastore database can found at:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase

Hive does not support Primary Key semantics, so the metadata retrieved from Hive metastore will not include any Primary Key definition. Replicat's KEYCOLS configuration syntax should instead be used to define primary keys when the Hive Metadata Provider is used.

KEYCOLS

The Replicat mapping configuration syntax KEYCOLS must be used to define primary keys in the target schema. The Oracle GoldenGate HBase Handler requires primary keys. Therefore, setting primary keys in the target schema is required when Replicat mapping is employed with HBase as the target. Additionally, the output of the Avro Formatters includes an Array field to hold the primary column names. Therefore, if Replicat mapping is employed with the Avro Formatters users should consider using KEYCOLS to identify the primary key columns.

Examples of configuring KEYCOLS will be discussed in Sample Configuration.

Supported Hive Data types:

  • BIGINT

  • BINARY

  • BOOLEAN

  • CHAR

  • DATE

  • DECIMAL

  • DOUBLE

  • FLOAT

  • INT

  • SMALLINT

  • STRING

  • TIMESTAMP

  • TINYINT

  • VARCHAR

For more information on Hive data types, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.

7.2.2 Runtime Prerequisites

  1. Configuring Hive with a Remote Metastore Database

    A list of supported databases that can be used to configure remote Hive metastore can be found at https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-SupportedBackendDatabasesforMetastore. In the following example, a MySQL database is configured as the Hive metastore. Configure the following properties in the ${HIVE_HOME}/conf/hive-site.xml Hive configuration file.

    Note:

    The ConnectionURL and driver class used in this example are specific to MySQL database. Change the values appropriately if any database other than MySQL is chosen.

    <property>
             <name>javax.jdo.option.ConnectionURL</name>        
             <value>jdbc:mysql://MYSQL_DB_IP:MYSQL_DB_PORT/DB_NAME?createDatabaseIfNotExist=false</value>
     </property>
     
     <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
     </property>
     
     <property>
              <name>javax.jdo.option.ConnectionUserName</name>
         <value>MYSQL_CONNECTION_USERNAME</value>
     </property>
     
     <property>
             <name>javax.jdo.option.ConnectionPassword</name>
             <value>MYSQL_CONNECTION_PASSWORD</value>
     </property>
    

    The list of parameters to be configured in the hive-site.xml file for a remote metastore can be found at https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase.

    Note:

    MySQL jdbc connector jar should be added in the Hive classpath that is,

    1. In HIVE_HOME/lib/ directory. DB_NAME should be replaced by a valid database name created in MySQL.

    2. Start the Hive Server:

      HIVE_HOME/bin/hiveserver2/bin/hiveserver2

    3. Start the Hive Remote Metastore Server:

      HIVE_HOME/bin/hive --service metastore

7.2.3 Classpath Configuration

You must configure two things in the gg.classpath configuration variable in order for the Oracle GoldenGate for Big Data Hive Metadata Provider to connect to Hive and run. The first is the hive-site.xml file and the second are the Hive and HDFS client jars. The client jars must match the version of Hive that the Oracle GoldenGate for Big Data Hive Metadata Provider is connecting.

  1. Create hive-site.xml file with the following properties:

    <configuration>
    <!-- Mandatory Property --> 
    <property>
    <name>hive.metastore.uris</name>
    <value>thrift://HIVE_SERVER_HOST_IP:9083</value>
    <property>
     
    <!-- Optional Property. Default value is 5 -->
    <property>
    <name>hive.metastore.connect.retries</name>
    <value>3</value>
    </property>
     
    <!-- Optional Property. Default value is 1 -->
    <property>
    <name>hive.metastore.client.connect.retry.delay</name>
    <value>10</value>
    </property>
     
    <!-- Optional Property. Default value is 600 seconds -->
    <property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>50</value>
    </property>
    
     </configuration>
    

    Note:

    For example, if the hive-site.xml file is created in /home/user/oggadp/dirprm directory, then gg.classpath entry will look like gg.classpath=/home/user/oggadp/dirprm/

  2. The default location of the Hive and HDFS client jars are the following directories:

    HIVE_HOME/hcatalog/share/hcatalog/*
    HIVE_HOME/lib/*
    HIVE_HOME/hcatalog/share/webhcat/java-client/*
    HADOOP_HOME/share/hadoop/common/*
    HADOOP_HOME/share/hadoop/common/lib/*
    HADOOP_HOME/share/hadoop/mapreduce/*
    

    Configure the gg.classpath exactly as shown in the preceding example. Creating a path to the hive-site.xml should simply contain the path with no wild card appended. The inclusion of the * wildcard in the path to the hive-site.xml file will cause it not to be picked up. Conversely, creating a path to the dependency jars should include the * wild card character in order to include all of the jar files in that directory in the associated classpath. Do not use *.jar.

7.2.4 Hive Metadata Provider Configuration

The configuration properties of the Hive Metadata Provider are detailed in this section.


Property Required/Optional Legal Values Default Explanation

gg.mdp.type

Required

hive

-

Selects Hive Metadata Provider

gg.mdp.connectionUrl

Required

Format without Kerberos Authentication:

jdbc:hive2://HIVE_SERVER_IP:HIVE_JDBC_PORT/HIVE_DB

Format with Kerberos Authentication:

jdbc:hive2://HIVE_SERVER_IP:HIVE_JDBC_PORT/HIVE_DB; principal=user/FQDN@MY.REALM

-

JDBC Connection URL of Hive Server

gg.mdp.driverClassName

Required

org.apache.hive.jdbc.HiveDriver

-

Fully qualified Hive JDBC Driver class name.

gg.mdp.userName

Optional

Valid username

""

Username to connect to Hive Database. userName property is not required when Kerberos Authentication is used. The Kerberos principal should be specified in the connection url as specified in connectionUrl property's legal values.

gg.mdp.password

Optional

Valid Password

""

Password to connect to Hive Database

gg.mdp.charset

Optional

Valid character set

UTF-8

Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target charset.

gg.mdp.nationalCharset

Optional

Valid character set

UTF-8

Specifies the character set of the column with character data type. This is used to convert the source data from trail file to the correct target charset.

For example, Indicates character set of columns like NCHAR, NVARCHAR in case of Oracle Database.

gg.mdp.authType

Optional

kerberos

none

 

gg.mdp.kerberosKeytabFile

Optional (Required if authType=kerberos)

Relative or absolute path to a Kerberos keytab file.

-

The keytab file allows Hive to access a password to perform kinit operation for Kerberos security.

gg.mdp.kerberosPrincipal

Optional (Required if authType=kerberos)

A legal Kerberos principal name(user/FQDN@MY.REALM)

-

The Kerberos principal name for Kerberos authentication.


7.2.5 Sample Configuration

This section provides an example for configuring the Hive Metadata Provider.

Consider a source with following table:

TABLE GG.TCUSTMER {
     CUST_CODE VARCHAR(4)   PRIMARY KEY,
     NAME VARCHAR(100),
     CITY VARCHAR(200),
     STATE VARCHAR(200)}

The example maps the column CUST_CODE (GG.TCUSTMER) in the source to CUST_CODE2 (GG_HIVE.TCUSTMER_HIVE) on the target and column CITY (GG.TCUSTMER) in the source to CITY2 (GG_HIVE.TCUSTMER_HIVE)on the target.

Mapping configuration in the Replicat.prm file:

MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2); 

The mapping definition for this example is:

  • Source schema GG is mapped to target schema GG_HIVE

  • Source column CUST_CODE is mapped to target column CUST_CODE2

  • Source column CITY is mapped to target column CITY2

  • USEDEFAULTS specifies that rest of the columns names are same on both source and target (NAME and STATE columns).

  • KEYCOLS is used to specify that CUST_CODE2 should be treated as primary key.

Since primary keys cannot be specified in Hive DDL, KEYCOLS is used to specify the primary keys.

Create schema and tables in Hive for the preceding example:

Note:

You can choose any schema name and are not restricted to the gg_hive schema name. The Hive schema can be pre-existing or newly created. You do this by modifying the connection URL (gg.mdp.connectionUrl) in the Java Adapter properties file and the mapping configuration in the Replicat .prm file. Once the schema name is changed, the connection URL (gg.mdp.connectionUrl) and mapping in Replicat .prm file should be updated.

To start the Hive CLI type the following command:

HIVE_HOME/bin/hive

To create a schema, GG_HIVE, in Hive, use the following command:

hive> create schema gg_hive;
OK
Time taken: 0.02 seconds

To create a table TCUSTMER_HIVE in GG_HIVE database type the following command:

hive> CREATE EXTERNAL TABLE `TCUSTMER_HIVE`(
    >   "CUST_CODE2" VARCHAR(4),
    >   "NAME" VARCHAR(30),
    >   "CITY2" VARCHAR(20),
    >   "STATE" STRING);
OK
Time taken: 0.056 seconds

Configuration in .properties file can be like the following:

gg.mdp.type=hive
gg.mdp.connectionUrl=jdbc:hive2://<HIVE_SERVER_IP>:10000/gg_hive
gg.mdp.driverClassName=org.apache.hive.jdbc.HiveDriver

Following is the sample output using delimited text formatter with ";" as delimiter for the preceding example.

I;GG_HIVE.TCUSTMER_HIVE;2015-10-07T04:50:47.519000;cust_code2;WILL;name;BG SOFTWARE CO;city2;SEATTLE;state;WA

A sample Replicat configuration file, Java Adapter properties file, and a Hive create table SQL script are included with the installation, and located at:

GoldenGate_install_directory/AdapterExamples/big-data/metadata_provider/hive

7.2.6 Security

The Hive server can be secured using Kerberos Authentication. Refer to the Hive documentation for your specific Hive release for instructions on how to secure the Hive server. The Oracle GoldenGate for Big Data Hive Metadata Provider can connect to Kerberos secured Hive server.

The HDFS core-site.xml and hive-site.xml should be in handler's classpath.

Following properties should be enabled in core-site.xml:

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> 
</property>
 
<property> 
<name>hadoop.security.authorization</name> 
<value>true</value> 
</property>

Following properties should be enabled in hive-site.xml

<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
 
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/path/to/keytab</value> <!-- Change this value -->
</property>
 
<property>
<name>hive.metastore.kerberos.principal</name>
<value>Kerberos Principal</value> <!-- Change this value -->
</property>
 
<property>
   <name>hive.server2.authentication</name>
    <value>KERBEROS</value>
</property>
 
<property>
   <name>hive.server2.authentication.kerberos.principal</name>
    <value>Kerberos Principal</value> <!-- Change this value -->
</property>
 
<property>
    <name>hive.server2.authentication.kerberos.keytab</name>
    <value>/path/to/keytab</value> <!-- Change this value -->
</property>

7.2.7 Metadata Change Event

Tables in Hive metastore should be updated/altered/created manually if there is a change in source database tables. You may wish to abort or suspend the Replicat process in the case of a metadata change event. The Replicat process can be aborted by adding the following to the Replicat configuration file (.prm file).

DDL INCLUDE ALL, EVENTACTIONS (ABORT)

Alternatively, the Replicat process can be suspended by adding the following to the Replication configuration file (.prm file).

DDL INCLUDE ALL, EVENTACTIONS (SUSPEND)

7.2.8 Limitations

Columns with binary data type cannot be used as primary key.

The source to target mapping defined in the Replicat configuration file is static. Oracle GoldenGate 12.2 supports DDL propagation and source schema evolution for Oracle Databases as the replication source. However, evolution of the source schemas may be problematic the static mapping configured in the Replicat configuration file.

7.2.9 Common Pitfalls

The most common problems encountered are the Java classpath issues. Oracle Hive Metadata Provider requires certain Hive and HDFS client libraries to be resolved in its classpath as a prerequisite.

The required client jar directories are listed in Classpath Configuration. Hive and HDFS client jars do not ship with Oracle GoldenGate for Big Data product. The client jars should be the same version as the Hive version to which Hive Metadata Provider is connecting.

In order to establish a connection to the Hive server, the hive-site.xml file must be in the classpath.

7.2.10 Troubleshooting

The Replicat process will abend with a "Table metadata resolution exception" if the mapped target table does not exist in Hive.

For example, consider the following mapping:

MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2);

The preceding mapping requires a table called TCUSTMER_HIVE to be created in schema GG_HIVE in the Hive metastore. Failure to create the GG_HIVE.TCUSTMER_HIVE table in Hive will result in following exception:

ERROR [main) - Table Metadata Resolution Exception
Unable to retrieve table matadata. Table : GG_HIVE.TCUSTMER_HIVE
NoSuchObjectException(message:GG_HIVE.TCUSTMER_HIVE table not found)