B Pig Knowledge Modules

This appendix provides information about the Pig knowledge modules.

This chapter includes the following sections:

B.1 LKM File to Pig

This KM loads data from a file into Pig.

The supported data formats are:

  • Delimited

  • JSON

  • Pig Binary

  • Text

  • Avro

  • Trevni

  • Custom

Data can be loaded and written to local file system or HDFS.

The following table describes the options for LKM File to Pig.

Table B-1 LKM File to Pig

Option Description

Storage Function

The storage function to be used to load data.

Select the storage function to be used to load data.

Schema for Complex Fields

The pig schema for simple/complex fields separated by comma (,).

Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the datastore.

Function Class

Fully qualified name of the class to be used as storage function to load data.

Specify the fully qualified name of the class to be used as storage function to load data.

Function Parameters

The parameters required for the custom function.

Specify the parameters that the loader function expects.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema

where,

MusicStore - the root element of the xml

movie - The element that wraps the child elements such as id, name, etc.

Third Argument is the representation of data in pig schema.

The names of the parameters are arbitrary and there can be any number of parameters.

Options

Additional options required for the storage function

Specify additional options required for the storage function.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

The last argument options can be specified as -namespace com.imdb -encoding utf8

Jars

The jar containing the storage function class and dependant libraries separated by colon (:).

Specify the jar containing the storage function class and dependant libraries separated by colon (:).

Storage Convertor

The converter that provides functions to cast from bytearray to each of Pig's internal types.

Specify the converter that provides functions to cast from bytearray to each of Pig's internal types.

The supported converter is Utf8StorageConverter.


B.2 LKM Pig to File

This KM unloads data to file from pig.

The supported data formats are:

  • Delimited

  • JSON

  • Pig Binary

  • Text

  • Avro

  • Trevni

  • Custom

Data can be stored in local file system or in HDFS.

The following table describes the options for LKM Pig to File.

Table B-2 LKM Pig to File

Option Description

Storage Function

The storage function to be used to load data.

Select the storage function to be used to load data.

Store Schema

If selected, stores the schema of the relation using a hidden JSON file.

Record Name

The Avro record name to be assigned to the bag of tuples being stored.

Specify a name to be assigned to the bag of tuples being stored.

Namespace

The namespace to be assigned to Avro/Trevni records, while storing data.

Specify a namespace for the bag of tuples being stored.

Delete Target File

Delete target file before Pig writes to the file.

If selected, the target file is deleted before storing data. This option effectively enables the target file to be overwritten.

Function Class

Fully qualified name of the class to be used as storage function to load data.

Specify the fully qualified name of the class to be used as storage function to load data.

Function Parameters

The parameters required for the custom function.

Specify the parameters that the loader function expects.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema

where,

MusicStore - the root element of the xml

movie - The element that wraps the child elements such as id, name, etc.

Third Argument is the representation of data in pig schema.

The names of the parameters are arbitrary and there can be any number of parameters.

Options

Additional options required for the storage function

Specify additional options required for the storage function.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

The last argument options can be specified as -namespace com.imdb -encoding utf8

Jars

The jar containing the storage function class and dependant libraries separated by colon (:).

Specify the jar containing the storage function class and dependant libraries separated by colon (:).

Storage Convertor

The converter that provides functions to cast from bytearray to each of Pig's internal types.

Specify the converter that provides functions to cast from bytearray to each of Pig's internal types.

The supported converter is Utf8StorageConverter.


B.3 LKM HBase to Pig

This KM loads data from a hbase table into Pig using HBaseStorage function.

The following table describes the options for LKM HBase to Pig.

Table B-3 LKM HBase to Pig

Option Description

Storage Function

The storage function to be used to load data.

HBaseStorage is used to load from a hbase table into pig.

Load Row Key

Load the row key as the first value in every tuple returned from HBase.

If selected, Loads the row key as the first value in every tuple returned from HBase. The row key is mapped to the 'key' column of the HBase data store in ODI.

Greater Than Min Key

Loads rows with key greater than the key specified for this option.

Specify the key value to load rows with key greater than the specified key value.

Less Than Min Key

Loads rows with row key less than the value specified for this option.

Specify the key value to load rows with key less than the specified key value.

Greater Than Or Equal Min Key

Loads rows with key greater than or equal to the key specified for this option.

Specify the key value to load rows with key greater than or equal to the specified key value.

Less Than Or Equal Min Key

Loads rows with row key less than or equal to the value specified for this option.

Specify the key value to load rows with key less than or equal to the specified key value.

Limit Rows

Maximum number of row to retrieve per region

Specify the maximum number of rows to retrieve per region.

Cached Rows

Number of rows to cache.

Specify the number of rows to cache.

Storage Convertor

The name of Caster to use to convert values.

Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter.

Column Delimiter

The delimiter to be used to separate columns in the columns list of HBaseStorage function.

Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace.

Timestamp

Return cell values that have a creation timestamp equal to this value.

Specify a timestamp to return cell values that have a creation timestamp equal to the specified value.

Min Timestamp

Return cell values that have a creation timestamp less than to this value.

Specify a timestamp to return cell values that have a creation timestamp less than to the specified value.

Max Timestamp

Return cell values that have a creation timestamp less than this value.

Specify a timestamp to return cell values that have a creation timestamp greater than or equal to the specified value.


B.4 LKM Pig to HBase

This KM stores data into a hbase table using HBaseStorage function.

The following table describes the options for LKM Pig to HBase.

Table B-4 LKM Pig to HBase

Option Description

Storage Function

The storage function to be used to store data. This is a read-only option, which can not be changed.

HBaseStore function is used to load data into hbase table.

Storage Convertor

The name of Caster to use to convert values.

Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter.

Column Delimiter

The delimiter to be used to separate columns in the columns list of HBaseStorage function.

Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace.

Disable Write Ahead Log

If it is true, write ahead log is set to false for faster loading into HBase.

If selected, write ahead log is set to false for faster loading into HBase. This must be used in extreme caution, since this could result in data loss. Default value is false.


B.5 LKM Hive to Pig

This KM loads data from a hive table into Pig using HCatalog.

The following table describes the options for LKM Hive to Pig.

Table B-5 LKM Hive to Pig

Option Description

Storage Function

The storage function to be used to load data. This is a read-only option, which can not be changed.

HCatLoader is used to load data from a hive table.


B.6 LKM Pig to Hive

This KM stores data into a hive table using HCatalog.

The following table describes the options for LKM Pig to Hive.

Table B-6 LKM Pig to Hive

Option Description

Storage Function

The storage function to be used to load data. This is a read-only option, which can not be changed.

HCatStorer is used to store data into a hive table.

Partition

The new partition to be created.

Represents key/value pairs for partition. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.


B.7 LKM SQL to Pig SQOOP

This KM integrates data from a JDBC data source into Pig.

It executes the following steps:

  1. Create a SQOOP configuration file, which contains the upstream query.

  2. Execute SQOOP to extract the source data and import into Staging file in csv format.

  3. Runs LKM File To Pig KM to load the Staging file into PIG.

  4. Drop the Staging file.

The following table describes the options for LKM SQL to Pig SQOOP.

Table B-7 LKM File to Pig

Option Description

STAGING_FILE_DELIMITER

Sqoop uses this delimiter to create the temporary file. If not specified, \\t will be used.

Storage Function

The storage function to be used to load data.

Select the storage function to be used to load data.

Schema for Complex Fields

The pig schema for simple/complex fields separated by comma (,).

Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the datastore.

Function Class

Fully qualified name of the class to be used as storage function to load data.

Specify the fully qualified name of the class to be used as storage function to load data.

Function Parameters

The parameters required for the custom function.

Specify the parameters that the loader function expects.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema

where,

MusicStore - the root element of the xml

movie - The element that wraps the child elements such as id, name, etc.

Third Argument is the representation of data in pig schema.

The names of the parameters are arbitrary and there can be any number of parameters.

Options

Additional options required for the storage function.

Specify additional options required for the storage function.

For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options)

The last argument options can be specified as -namespace com.imdb -encoding utf8

Jars

The jar containing the storage function class and dependant libraries separated by colon (:).

Specify the jar containing the storage function class and dependant libraries separated by colon (:).

Storage Convertor

The converter that provides functions to cast from bytearray to each of Pig's internal types.

Specify the converter that provides functions to cast from bytearray to each of Pig's internal types.

The supported converter is Utf8StorageConverter.


B.8 XKM Pig Aggregate

Summarize rows, for example using SUM and GROUP BY.

The following table describes the options for XKM Pig Aggregate.

Table B-8 XKM Pig Aggregate

Option Description

USING_ALGORITHM

Aggregation type; collected or merge.

PARTITION_BY

Specify the Hadoop partitioner.

PARTITIONER_JAR

Increase the parallelism of this job.

PARALLEL_NUMBER

Increase the parallelism of this job.


B.9 XKM Pig Distinct

Eliminates duplicates in data.

B.10 XKM Pig Expression

Define expressions to be reused across a single mapping.

B.11 XKM Pig Filter

Produce a subset of data by a filter condition.

B.12 XKM Pig Flatten

Un-nest the complex data according to the given options.

The following table describes the options for XKM Pig Flatten.

Table B-9 XKM Pig Flatten

Option Description

Default Expression

Default expression for null nested table objects, e.g. rating_table(obj_rating('-1', 'Unknown')).

This is used to return a row with default values for each null nested table object.


B.13 XKM Pig Join

Joins more than one input sources based on the join condition.

The following table describes the options for XKM Pig Join.

Table B-10 XKM Pig Join

Option Description

USING_ALGORITHM

Join type; replicated or skewed or merge.

PARTITION_BY

Specify the Hadoop partitioner.

PARTITIONER_JAR

Increase the parallelism of this job.

PARALLEL_NUMBER

Increase the parallelism of this job.


B.14 XKM Pig Lookup

Lookup data for a driving data source.

The following table describes the options for XKM Pig Lookup.

Table B-11 XKM Pig Lookup

Option Description

Jars

The jar containing the Used Defined Function classes and dependant libraries separated by colon (:).


B.15 XKM Pig Pivot

Takes data in separate rows, aggregates it, and converts it into columns.

B.16 XKM Pig Set

Perform UNION, MINUS or other set operations.

B.17 XKM Pig Sort

Sort data using an expression.

B.18 XKM Pig Split

Split data into multiple paths with multiple conditions.

B.19 XKM Pig Subquery Filter

Filter rows based on the results of a subquery.

B.20 XKM Pig Table Function

Pig table function access.

The following table descriptions the options for XKM Pig Table Function.

Table B-12 XKM Pig Table Function

Option Description

PIG_SCRIPT_CONTENT

User specified pig script content.


B.21 XKM Pig Unpivot

Transform a single row of attributes into multiple rows in an efficient manner.