C Spark Knowledge Modules

This appendix provides information about the Spark knowledge modules.

This chapter includes the following sections:

C.1 LKM File to Spark

This KM will load data from a file into a Spark Python variable and can be defined on the AP between the execution units, source technology File, target technology Spark Python.

The following tables describes the options for LKM File to Spark.

Table C-1 LKM File to Spark

Option	Description
Storage Function	The storage function to be used to load/store data.
CACHE_DATA	Persist the data with the default storage level.
InputFormatClass	Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}

This LKM uses StreamingContext.textFileStream() method to transfer file context as data stream. The directory is monitored while the Spark application is running. Any files copied from other locations into this directory is detected.

Table C-2 LKM File to Spark for Streaming

Option	Description
STREAMING_MODE	This option indicates whether the mapping should be executed in streaming mode. Default is FALSE.
Storage Function	If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is textFile.
Source Data store	Source data store is a directory and field separator should be defined.

Option

Description

STREAMING_MODE

This option indicates whether the mapping should be executed in streaming mode.

Default is FALSE.

Storage Function

If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically.

Default is textFile.

Source Data store

Source data store is a directory and field separator should be defined.

C.2 LKM Spark to File

This KM will store data into a file from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark Python, target technology File.

The following tables describes the options for LKM Spark to File.

Table C-3 LKM Spark to File

Option	Description
Storage Function	The storage function to be used to load/store data.
InputFormatClass	Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}

Table C-4 LKM Spark to File for streaming

Option	Description
Storage Function	If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is textFile.

Option

Description

Storage Function

If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically.

Default is textFile.

C.3 LKM Hive to Spark

This KM will load data from a Hive table into a Spark Python variable and can be defined on the AP between the execution units, source technology Hive, target technology Spark Python.

C.4 LKM Spark to Hive

This KM will store data into a Hive table from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark, target technology Hive.

The following tables describes the options for LKM Spark to Hive.

Table C-5 LKM Spark to Hive

Option	Description
CREATE_TARGET_TABLE	Create the target table.
OVERWRITE_TARGET_TABLE	Overwrite the target table.

C.5 LKM HDFS to Spark

This KM will load data from HDFS file to Spark.

Table C-6 LKM HDFS to Spark

Option	Description
Storage Function	The storage function is used to load or store data.
streamingContext	Name of the Streaming context.
InputFormatClass	Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}
Delete Spark Mapping Files	Delete temporary objects at the end of mapping.
Cache	Cache RDD across operations after computation.
Storage Level	The storage level is used to cache data.
Repartition	Repartiton the RDD after transformation of this component.
Level of Parallelism	Number of partitions.
Sort Partitions	Sort partitions by a key function when you repartition RDD.
Partition Sort Order	Sort partition order
Partition Key Function	Define keys of partition.
Partition Function	Customized partitioning function.

C.6 LKM Spark to HDFS

This KM will load data from Spark to HDFS file.

Table C-7 LKM Spark to HDFS

Option	Description
Storage Function	The storage function is used to load or store data.
OutputFormatClass	Class name of Hadoop Input Format.
KeyClass	Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text.
ValueClass	Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable.
KeyConverter	Fully qualified classname of key converter class.
ValueConverter	Fully qualified classname of value converter class.
Job Configuration	Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}
DELETE_TEMPORARY_OBJECTS	Delete temporary objects at the end of mapping.
Delete Spark Mapping Files	Delete temporary objects at the end of mapping.
Cache	Cache RDD across operations after computation.
Storage Level	The storage level is used to cache data.
Repartition	Repartiton the RDD after transformation of this component.
Level of Parallelism	Number of partitions.
Sort Partitions	Sort partitions by a key function when you repartition RDD.
Partition Sort Order	Sort partition order
Partition Key Function	Define keys of partition.
Partition Function	Customized partitioning function.

C.7 LKM Kafka to Spark

This KM will load data with Kafka source and Spark target and can be defined on the AP node that exist in Spark execution unit and have Kafka upstream node.

Table C-8 LKM Kafka to Spark for streaming

Option	Description
Storage Function	If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is createStream.
Key Decoder	Decodes the message key. Default is empty.
Value Decoder	Decodes the message value. Default is empty.
Group Id	Receiver group id parameter used to call the KafkaUtils.createStream. Note: In a group of receivers (all receivers having the same Group Id) every message will be received by a single receiver only.
Kakfa Params	Parameter used to call KafkaUitls.createStream.
storageLevel	Storage level is used for storing the received objects. Default is StorageLevel.MEMORY_AND_DISK_2.
Number of Partitions	Number of partition each thread gets data from Kafka
From Offsets	Parameter used in conjunction with createDirectStream function.

C.8 LKM Spark to Kafka

LKM Spark to Kafka works in both streaming and batch mode and can be defined on the AP between the execution units and have Kafka downstream node.

Table C-9 LKM Spark to Kafka

Option	Description
value.serializer	org.apache.kafka.common.serialization.StringSerializer

C.9 LKM SQL to Spark

This KM is designed to load data from Cassandra into Spark, but it can work with other JDBC sources. It can be defined on the AP node that have SQL source and Spark target.

Table C-10 LKM SQL to Spark

Option	Description
PARTITION_COLUMN	Column used for partitioning.
LOWER_BOUND	Lower bound of the partition column.
UPPER_BOUND	Upper bound of the partition column.
NUMBER_PARTITIONS	Number of partitions.
PREDICATES	List of predicates.

C.10 LKM Spark to SQL

This KM will load data from Spark into a Cassandra table and can be defined on the AP node that have Spark source and SQL target. It can work with other JDBC targets.

Table C-11 LKM Spark to SQL

Option	Description
CREATE_TARG_TABLE	Create target table.

C.11 RKM Cassandra

RKM Cassandra reverses these metadata elements:

Cassandra tables as data stores.

The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).
Cassandra columns as attributes with their data types.

C.12 XKM Spark Aggregate

Summarize rows, for example, using SUM and GROUP BY.

The following tables describes the options for XKM Spark Aggregate.

Table C-12 XKM Spark Aggregate

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

Table C-13 XKM Spark Aggregate for streaming

Option	Description
WINDOW_AGGREGATION	Enable window aggregation.
WINDOW_LENGTH	Number of batch intervals.
SLIDING_INTERVAL	The interval at which the window operation is performed.
STATEFUL_AGGREGATION	Enables stateful aggregation.
STATE_RETENTION_PERIOD	Time in seconds to retain a key or value aggregate in the Spark state object.
FORWARD_ONLY_UPDATED_ROWS	Modified aggregate values forwarded to downstream components.

C.13 XKM Spark Distinct

Eliminates duplicates in data and functionality is identical to the existing batch processing.

C.14 XKM Spark Expression

Define expressions to be reused across a single mapping.

C.15 XKM Spark Filter

Produce a subset of data by a filter condition.

The following tables describes the options for XKM Spark Filter.

Table C-14 XKM Spark Filter

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.16 XKM Spark Input Signature and Output Signature

Supports code generation for reusable mapping.

C.17 XKM Spark Join

Joins more than one input sources based on the join condition.

The following tables describes the options for XKM Spark Join.

Table C-15 XKM Spark Join

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.18 XKM Spark Lookup

Lookup data for a driving data source.

The following tables describes the options for XKM Spark Lookup.

Table C-16 XKM Spark Lookup

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.
MAP_SIDE	Defines whether the KM will do a map-side lookup or a reduce-side lookup and significantly impacts lookup performance.
KEY_BASED_LOOKUP	Only data corresponding to the lookup keys are retrieved.

Table C-17 XKM Spark Lookup for streaming

Option	Description
MAP_SIDE	MAP_SIDE=true : Suitable for small lookup data sets fitting into memory. This setting provides better performance by broadcasting the lookup data to all Spark tasks.
KEY_BASED_LOOKUP	For any incoming lookup key a Spark cache is checked. If the lookup record is present and not expired, the lookup data is served from the cache. If the lookup record is missing or expired, the data is re-loaded from the SQL source.
CACHE_RELOAD	This option defines when the lookup source data is loaded and refreshed and here are the corresponding values: NO_RELOAD: The lookup source data is loaded once on Spark application startup. RELOAD_EVERY_BATCH: The lookup source data is reloaded for every new Spark batch. RELOAD_BASE_ON_TIME: The lookup source data is loaded on Spark application startup and refreshed after the time interval provided by KM option CacheReloadInterval.
CACHE_RELOAD_INTERVAL	Defines the time data to be retained in the Spark cache. After this time the expired data or records are removed from cache.

C.19 XKM Spark Pivot

Take data in separate rows, aggregates it and converts it into columns.

The following tables describes the options for XKM Spark Pivot.

Table C-18 XKM Spark Pivot

Option	Description
CACHE_DATA	Persist the data with the default storage level.

Note:

XKM Spark Pivot does not support streaming.

C.20 XKM Spark Set

Perform UNION, MINUS or other set operations.

C.21 XKM Spark Sort

Sort data using an expression.

The following tables describes the options for XKM Spark Sort.

Table C-19 XKM Spark Sort

Option	Description
CACHE_DATA	Persist the data with the default storage level.
NUMBER_OF_TASKS	Task number.

C.22 XKM Spark Split

Split data into multiple paths with multiple conditions.

The following tables describes the options for XKM Spark Split.

Table C-20 XKM Spark Split

Option	Description
CACHE_DATA	Persist the data with the default storage level.

C.23 XKM Spark Table Function

Spark table function access.

The following tables describes the options for XKM Spark Table Function.

Table C-21 XKM Spark Table Function

Option	Description
SPARK_SCRIPT_FILE	User specifies the path of spark script file.
CACHE_DATA	Persist the data with the default storage level.

C.24 IKM Spark Table Function

Spark table function as target.

The following tables describes the options for IKM Spark Table Function.

Table C-22 IKM Spark Table Function

Option	Description
SPARK_SCRIPT_FILE	User specifies the path of spark script file.
CACHE_DATA	Persist the data with the default storage level.

C.25 XKM Spark Unpivot

Transform a single row of attributes into multiple rows in an efficient manner.

The following tables describes the options for XKM Spark Pivot.

Table C-23 XKM Spark Unpivot

Option	Description
CACHE_DATA	Persist the data with the default storage level.

Note:

XKM Spark Unpivot does not support streaming.