This appendix provides information about the Spark knowledge modules.
This chapter includes the following sections:
This KM will load data from a file into a Spark Python variable and can be defined on the AP between the execution units, source technology File, target technology Spark Python.
The following tables describes the options for LKM File to Spark.
Table C-1 LKM File to Spark
Option | Description |
---|---|
Storage Function |
The storage function to be used to load/store data. |
CACHE_DATA |
Persist the data with the default storage level. |
InputFormatClass |
Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat. |
KeyClass |
Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text. |
ValueClass |
Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable. |
KeyConverter |
Fully qualified classname of key converter class. |
ValueConverter |
Fully qualified classname of value converter class. |
Job Configuration |
Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'} |
This LKM uses StreamingContext.textFileStream() method to transfer file context as data stream. The directory is monitored while the Spark application is running. Any files copied from other locations into this directory is detected.
Table C-2 LKM File to Spark for Streaming
Option | Description |
---|---|
STREAMING_MODE |
This option indicates whether the mapping should be executed in streaming mode. Default is FALSE. |
Storage Function |
If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is textFile. |
Source Data store |
Source data store is a directory and field separator should be defined. |
This KM will store data into a file from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark Python, target technology File.
The following tables describes the options for LKM Spark to File.
Table C-3 LKM Spark to File
Option | Description |
---|---|
Storage Function |
The storage function to be used to load/store data. |
InputFormatClass |
Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat. |
KeyClass |
Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text. |
ValueClass |
Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable. |
KeyConverter |
Fully qualified classname of key converter class. |
ValueConverter |
Fully qualified classname of value converter class. |
Job Configuration |
Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'} |
Table C-4 LKM Spark to File for streaming
Option | Description |
---|---|
Storage Function |
If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is textFile. |
This KM will load data from a Hive table into a Spark Python variable and can be defined on the AP between the execution units, source technology Hive, target technology Spark Python.
This KM will store data into a Hive table from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark, target technology Hive.
The following tables describes the options for LKM Spark to Hive.
Table C-5 LKM Spark to Hive
Option | Description |
---|---|
CREATE_TARGET_TABLE |
Create the target table. |
OVERWRITE_TARGET_TABLE |
Overwrite the target table. |
This KM will load data from HDFS file to Spark.
Table C-6 LKM HDFS to Spark
Option | Description |
---|---|
Storage Function |
The storage function is used to load or store data. |
streamingContext |
Name of the Streaming context. |
InputFormatClass |
Classname of Hadoop InputFormat. For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat. |
KeyClass |
Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text. |
ValueClass |
Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable. |
KeyConverter |
Fully qualified classname of key converter class. |
ValueConverter |
Fully qualified classname of value converter class. |
Job Configuration |
Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'} |
Delete Spark Mapping Files |
Delete temporary objects at the end of mapping. |
Cache |
Cache RDD across operations after computation. |
Storage Level |
The storage level is used to cache data. |
Repartition |
Repartiton the RDD after transformation of this component. |
Level of Parallelism |
Number of partitions. |
Sort Partitions |
Sort partitions by a key function when you repartition RDD. |
Partition Sort Order |
Sort partition order |
Partition Key Function |
Define keys of partition. |
Partition Function |
Customized partitioning function. |
This KM will load data from Spark to HDFS file.
Table C-7 LKM Spark to HDFS
Option | Description |
---|---|
Storage Function |
The storage function is used to load or store data. |
OutputFormatClass |
Class name of Hadoop Input Format. |
KeyClass |
Fully qualified classname of key Writable class. For example, org.apache.hadoop.io.Text. |
ValueClass |
Fully qualified classname of value Writable class. For example, org.apache.hadoop.io.LongWritable. |
KeyConverter |
Fully qualified classname of key converter class. |
ValueConverter |
Fully qualified classname of value converter class. |
Job Configuration |
Hadoop configuration. For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'} |
DELETE_TEMPORARY_OBJECTS |
Delete temporary objects at the end of mapping. |
Delete Spark Mapping Files |
Delete temporary objects at the end of mapping. |
Cache |
Cache RDD across operations after computation. |
Storage Level |
The storage level is used to cache data. |
Repartition |
Repartiton the RDD after transformation of this component. |
Level of Parallelism |
Number of partitions. |
Sort Partitions |
Sort partitions by a key function when you repartition RDD. |
Partition Sort Order |
Sort partition order |
Partition Key Function |
Define keys of partition. |
Partition Function |
Customized partitioning function. |
This KM will load data with Kafka source and Spark target and can be defined on the AP node that exist in Spark execution unit and have Kafka upstream node.
Table C-8 LKM Kafka to Spark for streaming
Option | Description |
---|---|
Storage Function |
If STREAMING_MODE is set to true, the load function is changed to textFileStream automatically. Default is createStream. |
Key Decoder |
Decodes the message key. Default is empty. |
Value Decoder |
Decodes the message value. Default is empty. |
Group Id |
Receiver group id parameter used to call the KafkaUtils.createStream.
Note: In a group of receivers (all receivers having the same Group Id) every message will be received by a single receiver only. |
Kakfa Params |
Parameter used to call KafkaUitls.createStream. |
storageLevel |
Storage level is used for storing the received objects. Default is StorageLevel.MEMORY_AND_DISK_2. |
Number of Partitions |
Number of partition each thread gets data from Kafka |
From Offsets |
Parameter used in conjunction with createDirectStream function. |
LKM Spark to Kafka works in both streaming and batch mode and can be defined on the AP between the execution units and have Kafka downstream node.
Table C-9 LKM Spark to Kafka
Option | Description |
---|---|
value.serializer | org.apache.kafka.common.serialization.StringSerializer |
This KM is designed to load data from Cassandra into Spark, but it can work with other JDBC sources. It can be defined on the AP node that have SQL source and Spark target.
Table C-10 LKM SQL to Spark
Option | Description |
---|---|
PARTITION_COLUMN | Column used for partitioning. |
LOWER_BOUND | Lower bound of the partition column. |
UPPER_BOUND | Upper bound of the partition column. |
NUMBER_PARTITIONS | Number of partitions. |
PREDICATES | List of predicates. |
This KM will load data from Spark into a Cassandra table and can be defined on the AP node that have Spark source and SQL target. It can work with other JDBC targets.
Table C-11 LKM Spark to SQL
Option | Description |
---|---|
CREATE_TARG_TABLE | Create target table. |
Cassandra tables as data stores.
The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).
Cassandra columns as attributes with their data types.
Summarize rows, for example, using SUM and GROUP BY.
The following tables describes the options for XKM Spark Aggregate.
Table C-12 XKM Spark Aggregate
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
NUMBER_OF_TASKS |
Task number. |
Table C-13 XKM Spark Aggregate for streaming
Option | Description |
---|---|
WINDOW_AGGREGATION |
Enable window aggregation. |
WINDOW_LENGTH |
Number of batch intervals. |
SLIDING_INTERVAL |
The interval at which the window operation is performed. |
STATEFUL_AGGREGATION |
Enables stateful aggregation. |
STATE_RETENTION_PERIOD |
Time in seconds to retain a key or value aggregate in the Spark state object. |
FORWARD_ONLY_UPDATED_ROWS |
Modified aggregate values forwarded to downstream components. |
Eliminates duplicates in data and functionality is identical to the existing batch processing.
Produce a subset of data by a filter condition.
The following tables describes the options for XKM Spark Filter.
Table C-14 XKM Spark Filter
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
Joins more than one input sources based on the join condition.
The following tables describes the options for XKM Spark Join.
Table C-15 XKM Spark Join
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
NUMBER_OF_TASKS |
Task number. |
Lookup data for a driving data source.
The following tables describes the options for XKM Spark Lookup.
Table C-16 XKM Spark Lookup
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
NUMBER_OF_TASKS |
Task number. |
MAP_SIDE |
Defines whether the KM will do a map-side lookup or a reduce-side lookup and significantly impacts lookup performance. |
KEY_BASED_LOOKUP |
Only data corresponding to the lookup keys are retrieved. |
Table C-17 XKM Spark Lookup for streaming
Option | Description |
---|---|
MAP_SIDE |
MAP_SIDE=true : Suitable for small lookup data sets fitting into memory. This setting provides better performance by broadcasting the lookup data to all Spark tasks. |
KEY_BASED_LOOKUP |
For any incoming lookup key a Spark cache is checked.
|
CACHE_RELOAD |
This option defines when the lookup source data is loaded and refreshed and here are the corresponding values:
|
CACHE_RELOAD_INTERVAL |
Defines the time data to be retained in the Spark cache. After this time the expired data or records are removed from cache. |
Take data in separate rows, aggregates it and converts it into columns.
The following tables describes the options for XKM Spark Pivot.
Table C-18 XKM Spark Pivot
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
Note:
XKM Spark Pivot does not support streaming.Sort data using an expression.
The following tables describes the options for XKM Spark Sort.
Table C-19 XKM Spark Sort
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
NUMBER_OF_TASKS |
Task number. |
Split data into multiple paths with multiple conditions.
The following tables describes the options for XKM Spark Split.
Table C-20 XKM Spark Split
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
Spark table function access.
The following tables describes the options for XKM Spark Table Function.
Table C-21 XKM Spark Table Function
Option | Description |
---|---|
SPARK_SCRIPT_FILE |
User specifies the path of spark script file. |
CACHE_DATA |
Persist the data with the default storage level. |
Spark table function as target.
The following tables describes the options for IKM Spark Table Function.
Table C-22 IKM Spark Table Function
Option | Description |
---|---|
SPARK_SCRIPT_FILE |
User specifies the path of spark script file. |
CACHE_DATA |
Persist the data with the default storage level. |
Transform a single row of attributes into multiple rows in an efficient manner.
The following tables describes the options for XKM Spark Pivot.
Table C-23 XKM Spark Unpivot
Option | Description |
---|---|
CACHE_DATA |
Persist the data with the default storage level. |
Note:
XKM Spark Unpivot does not support streaming.