C Spark Knowledge Modules

This appendix provides information about the Spark knowledge modules.

This chapter includes the following sections:

C.1 LKM File to Spark

This KM will load data from a file into a Spark Python variable and can be defined on the AP between the execution units, source technology File, target technology Spark Python.

The following tables describes the options for LKM File to Spark.

Table C-1 LKM File to Spark

Option Description

Storage Function

The storage function to be used to load/store data.

CACHE_DATA

Persist the data with the default storage level.

InputFormatClass

Classname of Hadoop InputFormat.

For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.

KeyClass

Fully qualified classname of key Writable class.

For example, org.apache.hadoop.io.Text.

ValueClass

Fully qualified classname of value Writable class.

For example, org.apache.hadoop.io.LongWritable.

KeyConverter

Fully qualified classname of key converter class.

ValueConverter

Fully qualified classname of value converter class.

Job Configuration

Hadoop configuration.

For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}


C.2 LKM Spark to File

This KM will store data into a file from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark Python, target technology File.

The following tables describes the options for LKM Spark to File.

Table C-2 LKM Spark to File

Option Description

Storage Function

The storage function to be used to load/store data.

InputFormatClass

Classname of Hadoop InputFormat.

For example, org.apache.hadoop.mapreduce.lib.input.TextInputFormat.

KeyClass

Fully qualified classname of key Writable class.

For example, org.apache.hadoop.io.Text.

ValueClass

Fully qualified classname of value Writable class.

For example, org.apache.hadoop.io.LongWritable.

KeyConverter

Fully qualified classname of key converter class.

ValueConverter

Fully qualified classname of value converter class.

Job Configuration

Hadoop configuration.

For example, {'hbase.zookeeper.quorum': 'HOST', 'hbase.mapreduce.inputtable': 'TAB'}


C.3 LKM Hive to Spark

This KM will load data from a Hive table into a Spark Python variable and can be defined on the AP between the execution units, source technology Hive, target technology Spark Python.

C.4 LKM Spark to Hive

This KM will store data into a Hive table from a Spark Python variable and can be defined on the AP between the execution units, source technology Spark, target technology Hive.

The following tables describes the options for LKM Spark to Hive.

Table C-3 LKM Spark to Hive

Option Description

CREATE_TARGET_TABLE

Create the target table.

OVERWRITE_TARGET_TABLE

Overwrite the target table.


C.5 XKM Spark Aggregate

Summarize rows, for example, using SUM and GROUP BY.

The following tables describes the options for XKM Spark Aggregate.

Table C-4 XKM Spark Aggregate

Option Description

CACHE_DATA

Persist the data with the default storage level.

NUMBER_OF_TASKS

Task number.


C.6 XKM Spark Distinct

Eliminates duplicates in data.

The following tables describes the options for XKM Spark Distinct.

Table C-5 XKM Spark Distinct

Option Description

CACHE_DATA

Persist the data with the default storage level.


C.7 XKM Spark Expression

Define expressions to be reused across a single mapping.

C.8 XKM Spark Filter

Produce a subset of data by a filter condition.

The following tables describes the options for XKM Spark Filter.

Table C-6 XKM Spark Filter

Option Description

CACHE_DATA

Persist the data with the default storage level.


C.9 XKM Spark Flatten

Un-nest the complex data according to the given options.

The following tables describes the options for XKM Spark Flatten.

Table C-7 XKM Spark Flatten

Option Description

Default Expression

Default expression for null nested table objects. For example, rating_table(obj_rating('-1', 'Unknown')).

This is used to return a row with default values for each null nested table object.

CACHE_DATA

When set to TRUE, persist the results with Spark default storage level.

Default is FALSE.


C.10 XKM Spark Join

Joins more than one input sources based on the join condition.

The following tables describes the options for XKM Spark Join.

Table C-8 XKM Spark Join

Option Description

CACHE_DATA

Persist the data with the default storage level.

NUMBER_OF_TASKS

Task number.


C.11 XKM Spark Lookup

Lookup data for a driving data source.

The following tables describes the options for XKM Spark Lookup.

Table C-9 XKM Spark Join

Option Description

CACHE_DATA

Persist the data with the default storage level.

NUMBER_OF_TASKS

Task number.


C.12 XKM Spark Pivot

Take data in separate rows, aggregates it and converts it into columns.

The following tables describes the options for XKM Spark Pivot.

Table C-10 XKM Spark Pivot

Option Description

CACHE_DATA

Persist the data with the default storage level.


C.13 XKM Spark Set

Perform UNION, MINUS or other set operations.

C.14 XKM Spark Sort

Sort data using an expression.

The following tables describes the options for XKM Spark Sort.

Table C-11 XKM Spark Sort

Option Description

CACHE_DATA

Persist the data with the default storage level.

NUMBER_OF_TASKS

Task number.


C.15 XKM Spark Split

Split data into multiple paths with multiple conditions.

The following tables describes the options for XKM Spark Split.

Table C-12 XKM Spark Split

Option Description

CACHE_DATA

Persist the data with the default storage level.


C.16 XKM Spark Table Function

Spark table function access.

The following tables describes the options for XKM Spark Table Function.

Table C-13 XKM Spark Table Function

Option Description

SPARK_SCRIPT_FILE

User specifies the path of spark script file.

CACHE_DATA

Persist the data with the default storage level.


C.17 IKM Spark Table Function

Spark table function as target.

The following tables describes the options for IKM Spark Table Function.

Table C-14 IKM Spark Table Function

Option Description

SPARK_SCRIPT_FILE

User specifies the path of spark script file.

CACHE_DATA

Persist the data with the default storage level.


C.18 XKM Spark Unpivot

Transform a single row of attributes into multiple rows in an efficient manner.

The following tables describes the options for XKM Spark Pivot.

Table C-15 XKM Spark Unpivot

Option Description

CACHE_DATA

Persist the data with the default storage level.