10 Configuring ETL

This chapter provides information on how to extract, transform, and load (ETL) information in FCC Studio. As part of running ETL, you can configure new data source for a graph. Also you can perform additional configuration for a graph. For information on performing graph configurations, see Appendix - Performing Graph Configurations.

Performing Data Source Configuration

To configure a new data source for a graph, follow these steps:

1. Navigate to the fcc_studio_etl_queries table in the Studio Schema.

The FCDM related nodes and edges are available in the table.

2. If you want to add additional nodes or edges, you can add a new entry in the fcc_studio_etl_queries table.

3. Enter the following details in the fcc_studio_etl_queries table to add a new node or edge:

Table 1: fcc_studio_etl_queries Table Details
Column Name	Description	Applicable For
Type	Indicates the column name. Enter the value as NODE or EDGE.	Applicable for node and edge queries.
DF_NAME	Indicates the name for the node or edge.	Applicable for node and edge queries.
SOURCE	Indicates the source of the data. For example: FCDM or ICIJ	Applicable for node and edge queries.
DATAFRAME	Indicates the properties of the node or edge. NOTE: Enter this value only if the data source is Hive and not .a .csv file.	Applicable for node and edge queries.
QUERY	· If the source is Hive, provide the Hive query. · If the source is a .csv file, provide the query in the following format: spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("(##FILEPATH##)").select("node_1","node_2","rel_type","SourceID").withColumn("Label",lit("address of")).withColumnRenamed("node_1","from").withColumnRenamed("node_2","to").withColumnRenamed("rel_type","EDGE_TYPE").withColumnRenamed("SourceID","Source").filter(col("EDGE_TYPE")==="registered_address").withColumn("node_ID",concat(lit("#NUMBER#"),col("node_ID"))) For information on the Spark query parameters, see Appendix - Spark Query Parameters Appendix - Spark Query Parameters . NOTE: Ensure that the source .csv file is UTF-8 compatible.	Applicable for node and edge queries.
KEY_COLUMN_NAME	Set the value to the column name of your unique identifier, if the query is for node. For example: 'node_id'.	Applicable for node query.
SOURCE_NODE	Provide the DF_NAME of the node from which the edge starts from.	Applicable for edge query.
DESTINATION_NODE	Provide the DF_NAME of the node from which the edge ends to.	Applicable for edge query.
SOURCE_KEY_COLUMN_NAME	Set the value to the column name which has key_column values of the Source Node. For example: 'from_id'	Applicable for edge query.
DESTINATION_KEY_COLUMN_NAME	Set the value to the column name which has key_column values of the Destination Node. For example: 'to_id'	Applicable for edge query.
ACTIVE	Expected values: 'Y'/'N'. Set the value to Y to consider ETL and Graph loading.	Applicable for node and edge queries.

4. If the source is a .csv file, configure the file path in the fcc_studio_etl_files table.

NOTE	Ensure that the source .csv file is UTF-8 compatible.

5. Enter the following details in the fcc_studio_etl_files table to add file path:

Table 2: fcc_studio_etl_files Table Details
Column Name	Description
DF_NAME	Indicates the name of the node or edge.
DF_SEQ_NO	Indicates the unique sequence ID for each file.
FILEPATH	Enter the path where the .csv files are stored. NOTE: If one data frame has multiple .csv files, then make separate entries for all the files. For example: see Figure fcc_studio_etl_files Table .
FILEORDER	If data must be imported from multiple files, specify the order in which the files must be read.

Figure 1: fcc_studio_etl_files Table