C Spark Knowledge Modules

This appendix provides information about the Spark knowledge modules.

This chapter includes the following sections:

Section C.1, "LKM File to Spark"
Section C.2, "LKM Spark to File"
Section C.3, "LKM Hive to Spark"
Section C.4, "LKM Spark to Hive"
Section C.5, "XKM Spark Aggregate"
Section C.6, "XKM Spark Distinct"
Section C.7, "XKM Spark Expression"
Section C.8, "XKM Spark Filter"
Section C.9, "XKM Spark Flatten"
Section C.10, "XKM Spark Join"
Section C.11, "XKM Spark Lookup"
Section C.12, "XKM Spark Pivot"
Section C.13, "XKM Spark Set"
Section C.14, "XKM Spark Sort"
Section C.15, "XKM Spark Split"
Section C.16, "XKM Spark Table Function"
Section C.17, "IKM Spark Table Function"
Section C.18, "XKM Spark Unpivot"

C.1 LKM File to Spark

This KM will load data from a file into a Spark Python variable and can be defined on the AP between the execution units, source technology File, target technology Spark Python.

The following tables describes the options for LKM File to Spark.

Table C-1 LKM File to Spark

Option	Description
Storage Function	The storage function to be used to load/store data.
CACHE_DATA	Persist the data with the default storage level.
InputFormatClass	Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}

C.2 LKM Spark to File

This KM will store data into a file from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark Python, target technology File.

The following tables describes the options for LKM Spark to File.

Table C-2 LKM Spark to File

Option	Description
Storage Function	The storage function to be used to load/store data.
InputFormatClass	Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}

C.3 LKM Hive to Spark

This KM will load data from a Hive table into a Spark Python variable and can be defined on the AP between the execution units, source technology Hive, target technology Spark Python.

C.4 LKM Spark to Hive

This KM will store data into a Hive table from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark, target technology Hive.

The following tables describes the options for LKM Spark to Hive.

Table C-3 LKM Spark to Hive

Option	Description
CREATE_TARGET_TABLE	Create the target table.
OVERWRITE_TARGET_TABLE	Overwrite the target table.

C.5 XKM Spark Aggregate

Summarize rows, for example, using SUM and GROUP BY.

The following tables describes the options for XKM Spark Aggregate.

Table C-4 XKM Spark Aggregate

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.6 XKM Spark Distinct

Eliminates duplicates in data.

The following tables describes the options for XKM Spark Distinct.

Table C-5 XKM Spark Distinct

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.7 XKM Spark Expression

Define expressions to be reused across a single mapping.

C.8 XKM Spark Filter

Produce a subset of data by a filter condition.

The following tables describes the options for XKM Spark Filter.

Table C-6 XKM Spark Filter

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.9 XKM Spark Flatten

Un-nest the complex data according to the given options.

The following tables describes the options for XKM Spark Flatten.

Table C-7 XKM Spark Flatten

Option	Description
Default Expression	Default expression for null nested table objects. For example, rating_table(obj_rating('-1', 'Unknown')). This is used to return a row with default values for each null nested table object.
CACHE_DATA	When set to TRUE, persist the results with Spark default storage level. Default is FALSE.

Option

Description

Default Expression

Default expression for null nested table objects. For example, rating_table(obj_rating('-1', 'Unknown')).

This is used to return a row with default values for each null nested table object.

CACHE_DATA

When set to TRUE, persist the results with Spark default storage level.

Default is FALSE.

C.10 XKM Spark Join

Joins more than one input sources based on the join condition.

The following tables describes the options for XKM Spark Join.

Table C-8 XKM Spark Join

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.11 XKM Spark Lookup

Lookup data for a driving data source.

The following tables describes the options for XKM Spark Lookup.

Table C-9 XKM Spark Join

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.12 XKM Spark Pivot

Take data in separate rows, aggregates it and converts it into columns.

The following tables describes the options for XKM Spark Pivot.

Table C-10 XKM Spark Pivot

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.13 XKM Spark Set

Perform UNION, MINUS or other set operations.

C.14 XKM Spark Sort

Sort data using an expression.

The following tables describes the options for XKM Spark Sort.

Table C-11 XKM Spark Sort

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.15 XKM Spark Split

Split data into multiple paths with multiple conditions.

The following tables describes the options for XKM Spark Split.

Table C-12 XKM Spark Split

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.16 XKM Spark Table Function

Spark table function access.

The following tables describes the options for XKM Spark Table Function.

Table C-13 XKM Spark Table Function

Option	Description
SPARK_SCRIPT_FILE	User specifies the path of spark script file.
CACHE_DATA	Persist the data with the default storage level.

C.17 IKM Spark Table Function

Spark table function as target.

The following tables describes the options for IKM Spark Table Function.

Table C-14 IKM Spark Table Function

Option	Description
SPARK_SCRIPT_FILE	User specifies the path of spark script file.
CACHE_DATA	Persist the data with the default storage level.

C.18 XKM Spark Unpivot

Transform a single row of attributes into multiple rows in an efficient manner.

The following tables describes the options for XKM Spark Pivot.

Table C-15 XKM Spark Unpivot

Option	Description
CACHE_DATA	Persist the data with the default storage level.