B Pig Knowledge Modules

This appendix provides information about the Pig knowledge modules.

This chapter includes the following sections:

Section B.1, "LKM File to Pig"
Section B.2, "LKM Pig to File"
Section B.3, "LKM HBase to Pig"
Section B.4, "LKM Pig to HBase"
Section B.5, "LKM Hive to Pig"
Section B.6, "LKM Pig to Hive"
Section B.7, "LKM SQL to Pig SQOOP"
Section B.8, "XKM Pig Aggregate"
Section B.9, "XKM Pig Distinct"
Section B.10, "XKM Pig Expression"
Section B.11, "XKM Pig Filter"
Section B.12, "XKM Pig Flatten"
Section B.13, "XKM Pig Join"
Section B.14, "XKM Pig Lookup"
Section B.15, "XKM Pig Pivot"
Section B.16, "XKM Pig Set"
Section B.17, "XKM Pig Sort"
Section B.18, "XKM Pig Split"
Section B.19, "XKM Pig Subquery Filter"
Section B.20, "XKM Pig Table Function"
Section B.21, "XKM Pig Unpivot"

B.1 LKM File to Pig

This KM loads data from a file into Pig.

The supported data formats are:

Delimited
JSON
Pig Binary
Text
Avro
Trevni
Custom

Data can be loaded and written to local file system or HDFS.

The following table describes the options for LKM File to Pig.

Table B-1 LKM File to Pig

Option	Description
Storage Function	The storage function to be used to load data. Select the storage function to be used to load data.
Schema for Complex Fields	The pig schema for simple/complex fields separated by comma (,). Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the datastore.
Function Class	Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data.
Function Parameters	The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters.
Options	Additional options required for the storage function Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8
Jars	The jar containing the storage function class and dependant libraries separated by colon (:). Specify the jar containing the storage function class and dependant libraries separated by colon (:).
Storage Convertor	The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter.

B.2 LKM Pig to File

This KM unloads data to file from pig.

The supported data formats are:

Delimited
JSON
Pig Binary
Text
Avro
Trevni
Custom

Data can be stored in local file system or in HDFS.

The following table describes the options for LKM Pig to File.

Table B-2 LKM Pig to File

Option	Description
Storage Function	The storage function to be used to load data. Select the storage function to be used to load data.
Store Schema	If selected, stores the schema of the relation using a hidden JSON file.
Record Name	The Avro record name to be assigned to the bag of tuples being stored. Specify a name to be assigned to the bag of tuples being stored.
Namespace	The namespace to be assigned to Avro/Trevni records, while storing data. Specify a namespace for the bag of tuples being stored.
Delete Target File	Delete target file before Pig writes to the file. If selected, the target file is deleted before storing data. This option effectively enables the target file to be overwritten.
Function Class	Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data.
Function Parameters	The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters.
Options	Additional options required for the storage function Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8
Jars	The jar containing the storage function class and dependant libraries separated by colon (:). Specify the jar containing the storage function class and dependant libraries separated by colon (:).
Storage Convertor	The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter.

B.3 LKM HBase to Pig

This KM loads data from a hbase table into Pig using HBaseStorage function.

The following table describes the options for LKM HBase to Pig.

Table B-3 LKM HBase to Pig

Option	Description
Storage Function	The storage function to be used to load data. HBaseStorage is used to load from a hbase table into pig.
Load Row Key	Load the row key as the first value in every tuple returned from HBase. If selected, Loads the row key as the first value in every tuple returned from HBase. The row key is mapped to the 'key' column of the HBase data store in ODI.
Greater Than Min Key	Loads rows with key greater than the key specified for this option. Specify the key value to load rows with key greater than the specified key value.
Less Than Min Key	Loads rows with row key less than the value specified for this option. Specify the key value to load rows with key less than the specified key value.
Greater Than Or Equal Min Key	Loads rows with key greater than or equal to the key specified for this option. Specify the key value to load rows with key greater than or equal to the specified key value.
Less Than Or Equal Min Key	Loads rows with row key less than or equal to the value specified for this option. Specify the key value to load rows with key less than or equal to the specified key value.
Limit Rows	Maximum number of row to retrieve per region Specify the maximum number of rows to retrieve per region.
Cached Rows	Number of rows to cache. Specify the number of rows to cache.
Storage Convertor	The name of Caster to use to convert values. Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter.
Column Delimiter	The delimiter to be used to separate columns in the columns list of HBaseStorage function. Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace.
Timestamp	Return cell values that have a creation timestamp equal to this value. Specify a timestamp to return cell values that have a creation timestamp equal to the specified value.
Min Timestamp	Return cell values that have a creation timestamp less than to this value. Specify a timestamp to return cell values that have a creation timestamp less than to the specified value.
Max Timestamp	Return cell values that have a creation timestamp less than this value. Specify a timestamp to return cell values that have a creation timestamp greater than or equal to the specified value.

B.4 LKM Pig to HBase

This KM stores data into a hbase table using HBaseStorage function.

The following table describes the options for LKM Pig to HBase.

Table B-4 LKM Pig to HBase

Option	Description
Storage Function	The storage function to be used to store data. This is a read-only option, which can not be changed. HBaseStore function is used to load data into hbase table.
Storage Convertor	The name of Caster to use to convert values. Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter.
Column Delimiter	The delimiter to be used to separate columns in the columns list of HBaseStorage function. Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace.
Disable Write Ahead Log	If it is true, write ahead log is set to false for faster loading into HBase. If selected, write ahead log is set to false for faster loading into HBase. This must be used in extreme caution, since this could result in data loss. Default value is false.

B.5 LKM Hive to Pig

This KM loads data from a hive table into Pig using HCatalog.

The following table describes the options for LKM Hive to Pig.

Table B-5 LKM Hive to Pig

Option	Description
Storage Function	The storage function to be used to load data. This is a read-only option, which can not be changed. HCatLoader is used to load data from a hive table.

Option

Description

Storage Function

The storage function to be used to load data. This is a read-only option, which can not be changed.

HCatLoader is used to load data from a hive table.

B.6 LKM Pig to Hive

This KM stores data into a hive table using HCatalog.

The following table describes the options for LKM Pig to Hive.

Table B-6 LKM Pig to Hive

Option	Description
Storage Function	The storage function to be used to load data. This is a read-only option, which can not be changed. HCatStorer is used to store data into a hive table.
Partition	The new partition to be created. Represents key/value pairs for partition. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

Option

Description

Storage Function

The storage function to be used to load data. This is a read-only option, which can not be changed.

HCatStorer is used to store data into a hive table.

Partition

The new partition to be created.

Represents key/value pairs for partition. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

B.7 LKM SQL to Pig SQOOP

This KM integrates data from a JDBC data source into Pig.

It executes the following steps:

Create a SQOOP configuration file, which contains the upstream query.
Execute SQOOP to extract the source data and import into Staging file in csv format.
Runs LKM File To Pig KM to load the Staging file into PIG.
Drop the Staging file.

The following table describes the options for LKM SQL to Pig SQOOP.

Table B-7 LKM File to Pig

Option	Description
STAGING_FILE_DELIMITER	Sqoop uses this delimiter to create the temporary file. If not specified, \\t will be used.
Storage Function	The storage function to be used to load data. Select the storage function to be used to load data.
Schema for Complex Fields	The pig schema for simple/complex fields separated by comma (,). Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the datastore.
Function Class	Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data.
Function Parameters	The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters.
Options	Additional options required for the storage function. Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8
Jars	The jar containing the storage function class and dependant libraries separated by colon (:). Specify the jar containing the storage function class and dependant libraries separated by colon (:).
Storage Convertor	The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter.

B.8 XKM Pig Aggregate

Summarize rows, for example using SUM and GROUP BY.

The following table describes the options for XKM Pig Aggregate.

Table B-8 XKM Pig Aggregate

Option	Description
USING_ALGORITHM	Aggregation type; collected or merge.
PARTITION_BY	Specify the Hadoop partitioner.
PARTITIONER_JAR	Increase the parallelism of this job.
PARALLEL_NUMBER	Increase the parallelism of this job.

B.9 XKM Pig Distinct

Eliminates duplicates in data.

B.10 XKM Pig Expression

Define expressions to be reused across a single mapping.

B.11 XKM Pig Filter

Produce a subset of data by a filter condition.

B.12 XKM Pig Flatten

Un-nest the complex data according to the given options.

The following table describes the options for XKM Pig Flatten.

Table B-9 XKM Pig Flatten

Option	Description
Default Expression	Default expression for null nested table objects, e.g. rating_table(obj_rating('-1', 'Unknown')). This is used to return a row with default values for each null nested table object.

Option

Description

Default Expression

Default expression for null nested table objects, e.g. rating_table(obj_rating('-1', 'Unknown')).

This is used to return a row with default values for each null nested table object.

B.13 XKM Pig Join

Joins more than one input sources based on the join condition.

The following table describes the options for XKM Pig Join.

Table B-10 XKM Pig Join

Option	Description
USING_ALGORITHM	Join type; replicated or skewed or merge.
PARTITION_BY	Specify the Hadoop partitioner.
PARTITIONER_JAR	Increase the parallelism of this job.
PARALLEL_NUMBER	Increase the parallelism of this job.

B.14 XKM Pig Lookup

Lookup data for a driving data source.

The following table describes the options for XKM Pig Lookup.

Table B-11 XKM Pig Lookup

Option	Description
Jars	The jar containing the Used Defined Function classes and dependant libraries separated by colon (:).

B.15 XKM Pig Pivot

Takes data in separate rows, aggregates it, and converts it into columns.

B.16 XKM Pig Set

Perform UNION, MINUS or other set operations.

B.17 XKM Pig Sort

Sort data using an expression.

B.18 XKM Pig Split

Split data into multiple paths with multiple conditions.

B.19 XKM Pig Subquery Filter

Filter rows based on the results of a subquery.

B.20 XKM Pig Table Function

Pig table function access.

The following table descriptions the options for XKM Pig Table Function.

Table B-12 XKM Pig Table Function

Option	Description
PIG_SCRIPT_CONTENT	User specified pig script content.

B.21 XKM Pig Unpivot

Transform a single row of attributes into multiple rows in an efficient manner.