9 Execute ETL

Use this chapter to prepare and perform the batches to execute the ETL process. You can also verify the status of all tasks at the end of batch execution. You can verify both the overall status of the batch as well as individual task status. Execute the ETL by preparing and running the batches. You can also verify the status of all tasks at the end of batch execution. You can verify both the overall status of the batch as well as individual task status.

Topics:

·        Prepare Batches 

·        Perform Batches 

·        Verify Batch Execution

 Prepare the Batches

Use this section to prepare batches to execute the ETL. Batches enable you to load graphs, run notebooks, and move data from Oracle Database or Big Data to FCC Studio. Batches are prepared based on the Realms you are using.

·        Prepare Batches for FCCRealm

·        Prepare Batches for FCCSAML Realm

Prepare Batches for FCCRealm

To prepare the batches for FCCRealm, follow these steps:

1.       Copy all the jars from the <STUDIO_INSTALLATION_PATH>/ficdb/lib directory to the <FIC_HOME of OFSAA_Installed_Path>/ficdb/lib directory.

2.      Copy the NBExecutor.txt file from the <STUDIO_INSTALLATION_PATH>/ficdb/bin direc­tory to the <FIC_HOME of OFSAA_Installed_Path>/ficdb/bin directory.

3.      Navigate to the <Studio_Installation_Path>/ficdb/bin directory.

4.     Run the FCCM_Studio_Set_UserPass.sh command as follows:

§        FCCM_Studio_Set_UserPass.sh --username "Username" --password "Password"

or

§        FCCM_Studio_Set_UserPass.sh -u "USERNAME" -p "PASSWORD"

The FCC_Studio_SecretKey.properties and NBExecutor.txt files are created in the <Studio_Installation_Path>/ficdb/conf directory.

NOTE

1.       Ensure that the FCC_Studio_SecretKey.properties and NBExecutor.txt files are present in the <Studio_Installation_Path>/ficdb/conf directory before executing a notebook batch.

2.      If only NBExecutor.txt file is present in the <Stu­dio_Installation_Path>/ficdb/conf directory, then re-execute the FCCM_Studio_Set_UserPass.sh command with username and password to create a new FCC_Studio_SecretKey.properties file and update the NBExecutor.txt file.

Prepare Batches for FCCSAML Realm

To prepare the batches for FCCSamlRealm, you must generate an API token and configure the NBExecutor.properties.

To generate the API token, follow these steps:

1.       Navigate to the <Studio_Installation_Path>/ficdb/bin directory.

2.      Run the shell script:  fic_db_home/FCCM_Studio_Generate_APIToken.sh.

3.      Run the script:  fic_db_home/FCCM_Studio_Generate_APIToken.sh APPNAME.

For example, use the./FCCM_Studio_Generate_APIToken.sh BATCH_USER.

In the NBExecutor.properties file, specify the following details:

·        saml=true

·        username=<BATCH_USERNAME>

·        password=<BATCHUSER_PASSWORD>

·        apiToken=<API_TOKEN>

NOTE

BATCH_USERNAME  and BATCHUSER_PASSWORD can be NULL.

 

Perform the Batches

Batches are performed to execute the ETL process. The batches contain Sqoop Job, Connector Job, Graph Job, and Similarity Edge Generation Job in OFSAAI. You can also execute the ETL process by running the scripts without configuring the batches.

You can perform batches in the following ways:

·        Perform batches using OFSAAI:  For more information, see Create and Execute Run Executable.

·        Perform batches using shell script: For more information, see relevant jobs in Run ETL.

Run ETL

To run the ETL, you must perform these jobs:

·        Sqoop Job

·        Connector Job

·        Graph Job

·        Similarity Edge Generation Job

NOTE

You must not trigger the same ETL job twice until it is completed.

Sqoop Job

Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-executing is used in the incremental import, which can import the updated rows from the RDBMS table to HDFS.

NOTE

·        This section is applicable for FCC Studio with non-OFSAA.

·        Before performing a Sqoop job, verify the Schema creation (it replicates the table structure of FCDM tables from atomic schema into HIVE schema).

 

The Sqoop Job moves data from BD/ECM Atomic tables to Hive tables based on the date range. This task can be skipped in the graph if FCDM data is not required.

To execute the Sqoop job, follow these steps:

1.       Navigate to the FIC_DB_HOME/bin. directory.

2.      If this is your first Sqoop job, execute the following command.  

./FCCM_Studio_SchemaCreation.sh HIVE

3.      The Sqoop job can be scheduled or executed using the following command:

NOTE

This example is applicable to shell script.

 

./FCCM_Studio_ETL_SqoopJob.sh <FROM_FIC_MIS_DATE> <TO_FIC_MIS_DATE> SNAPSHOT_DT<=SNAPSHOT_DATE> <Batch_ID>

For example:

./FCCM_Studio_ETL_SqoopJob.sh "20151201" "20200412" "SNAPSHOT_DT=20200415" "BatchID_001"

Where:

§        FROM_FIC_MIS_DATE is 20151201

§        TO_FIC_MIS_DATE is 20200412

§        SNAPSHOT_DT is 20200415

§        Batch_ID is BatchID_001

NOTE

The date format is “YYYYMMDD

 

If the date parameters are passed as null, then the values of these parameters are calculated based on ETL_PROCESSING_RANGE, and the date’s value is as follows:

§        Snapshot_dt is considered as the current date.

§        To_fic_mis_date is considered as a yesterday’s date.

§        From_fic_mis_date is considered as a date which is etl_processing_range behind to_fic_mis_date.

For example:

./FCCM_Studio_ETL_SqoopJob.sh "null" "null" "SNAPSHOT_DT=null" "BatchID_001"

If, the ETL processing range: 2Y, 3M, 10D (2 years, 3 months, 10 days) and Present Date: 20200815, then:

§        Snapshot_dt is 20200815

§        To_fic_mis_date is 20200814

§        From_fic_mis_date is 20180504

Connector Job

The connector job transforms the data from the Hive table or the .csv files based on data source into the node and edge format and recognizes the changes in data for the graph.

To execute the connector job, follow these steps:

1.       Navigate to the FIC_DB_HOME/bin. directory.

2.      Execute the following command:

./FCCM_Studio_ETL_Connector.sh <Source> SNAPSHOT_DT=<SNAPSHOT_DATE>

NOTE

The date format is “YYYYMMDD

 

For example,

./FCCM_Studio_ETL_Connector.sh FCDM SNAPSHOT_DT=20200415

Where:

§        Source: FCDM

§        Snapshot DT: 20200415

FCC Studio has a ready-to-use configuration for FCDM and ICIJ data source as per the graph model. If the date parameter is passed as null, then the snapshot date is taken from the previous Sqoop job if present, otherwise it is present day.

For example:

./FCCM_Studio_ETL_Connector.sh FCDM SNAPSHOT_DT=null

The snapshot date is 20200815 (refer the example from Sqoop job)

·        For ready-to-use, run the following command for FCDM.

./FCCM_Studio_ETL_Connector.sh FCDM SNAPSHOT_DT=20200415

·        For ready-to-use, run the following command for ICIJ.

./FCCM_Studio_ETL_Connector.sh ICIJ SNAPSHOT_DT=20200415.

NOTE

When the connector snapshot date is 'Null' then it takes a snapshot of the date of the last run Sqoop job.  

 

Graph Job

The Graph Job task generates the JSON files for the PGX server to load with other .csv files for all the sources and updates the changes into the PGX server.

To execute the Graph job, follow these steps:

1.       Navigate to the FIC_DB_HOME/bin. directory.

2.      Execute the following command:

 ./FCCM_Studio_ETL_Graph.sh.

After the first execution of this task, start the PGX server to load the graph, which can be queried and viewed in the FCC Studio Notebook.

Similarity Edge Generation Job

This task generates the similarity edges based on the ruleset. The similarity edges above the manual threshold are also added to the Graph.

To execute the Similarity Edge Generation job, follow these steps:

1.       Navigate to the FIC_DB_HOME/bin. directory.

2.      Execute the following command:

 ./FCCM_Studio_ETL_BulkSimilarityEdgeGeneration.sh.

 

NOTE

This job must be triggered when you restart their PGX server.

 

Verify Batch Execution

Use this section to verify the status of all tasks at the end of batch execution. You can verify both the overall status of the batch and individual task status.

Topics:

·        Verify Sqoop Job

·        Verify Connector Job

·        Verify Graph Job

·        Verify Similarity Edge Generation Job

·        Verify Oracle Schema Tables

Verify Sqoop Job

Use this section to verify logs and Hive tables sqoop job.

Verify Logs

To verify logs in Sqoop Job, follow these steps:

1.       Navigate to the <Studio Home>/logs/ directory.

2.      Open the batchservice.log file. The overall status and individual status of each moved table are displayed. Also, errors are displayed if any individual table has failed.

Based on this you can fix accordingly or contact My Oracle Support in case of any errors.

Verify Hive Tables

To verify the Hive table in Sqoop Job, follow these steps:

1.       Connect to the Hive Schema

2.      Verify if data was moved into the respective tables (based on logs) for the snapshot date of batch.

Verify Connector Job

Use this section to verify logs and Hive tables for the connector job.

Verify Logs

To verify logs in Connector Job, follow these steps:

1.       Navigate to <Studio Home>/logs/ >

2.      Open the batchservice.log file. The overall status and status of each entity is displayed. Also, errors are displayed if any entity has failed.

Based on this you can fix accordingly or contact My Oracle Support in case of any errors.

Verify Hive Tables

To verify the Hive table in Connector Job, follow these steps:

1.       Connect to the Hive Schema

2.      Verify if table names: <Source>_<entity_name>(example: fcdm_customer) are present and populated or not based on the log.

Verifying Indices in Elasticsearch

To verify indices in the elasticsearch, follow these steps:

1.       Enter the URL in the following format into the browser:

http://<Elastic_Search_Hostname>:<Elastic_Search_Port>/_cat/indices

All the indices must be displayed with the same snapshot date with which the job is triggered.

2.      Format: <Index name>_<Snapshot Date>

For example:

§        fcdm_customer_2020-03-01 

§        icij_bahama_external_address_2020-03-01

Verify Graph Job

Use this section to verify logs and Hive tables for graph job.

Verify Logs

To verify logs in the Graph Job, follow these steps:

1.       Navigate to the <Studio Home>/logs/ > directory.

2.      Open the batchservice.log file. The overall status and status of each entity is displayed. Also, errors are displayed if any entity has failed.

Based on this you can fix accordingly or contact My Oracle Support in case of any errors.

For example, if the fix requires a query change or configuration changes,  follow the cleanup steps and re-run the tasks after fixing it.

 

Verify Similarity Edge Generation Job

Use this section to verify logs and Hive tables for the Similarity Edge Generation job.

Verify Logs

To verify logs in the Graph Job, follow these steps:

1.       Navigate to the <Studio Home>/logs/ > directory.

2.      Open the batchservice.log and entity-resolution.log file. The overall status and status of each ruleset is displayed. Also, errors are displayed if any entity has failed.

Based on this you can fix accordingly or contact My Oracle Support in case of any errors.

Verify Oracle Schema Tables

To verify logs in the Graph Job, follow these steps:

1.       Navigate to the Oracle Studio schema.

2.      Verify if the similarity edges are formed for the following tables:

§        fcc_er_matched_edges

§        fcc_er_matched_edges_manual

 

Clean up  for ETL

If any ETL jobs are failed, and you want to re-run the job you must clean up the ETL.

To clean up the ETL, follow these steps:

1.       Navigate to FCC Studio schema based on the source (for example, FCDM, ICIJ), and delete the following tables.

§        fcc_studio_graph_entity_provider

§        fcc_studio_graph_plug_edge_status  

2.      Drop the tables created in Hive schema.  

§        If you want to clean up the ICIJ job, then drop ICIJ related tables. For example, icij_paradise_external_entity.

§        If you want to clean up the FCDM job, then drop FCDM related tables. For example, fcdm_customer.

3.      Truncate the tables created after the schema creation job (For example, cust, acct, wire_trxn, fcc_studio_nodeedge_lookup, and so on)

4.     Delete the folder where graph.json and .csv files are created  that is, "HDFS_GRAPH_FILES_PATH" (refer config.sh under the <Studio installation path>/bin path.