4.4.1 Create Index and Load the Data
Note:
When the MATCHING_MECHANISM parameter is OS, ensure that you have configured the Logstash parameter as true (index.logstashconf.apply
) in the load-to-open-search
application.properties
to load data from the Database.
Job
ER_Create_And_Load_Data_Into_Index.sh
performs the
following:
- It creates all the output tables required at the different stages of Entity resolution tasks.
- Input to this job will be pipeline id as an argument so that all the tables related to that pipeline ID will be created.
- Index view table, Matching output table, Manual matches output table, Merge Map output table, Manual map merge output table, final dataset output tables. This task will create all these tables.
- When processing high-volume data, the index-loading step in this
job may take longer for the current FIC_MIS_DATE as well as the next
FIC_MIS_DATE execution. In this case, you need to refer to the Compliance Studio
log files present in the
<COMPLIANCE_STUDIO_INSTALLATION_PATH>/deployed/logs
directory. The log files are:- er-batch.log
- load-to-es.log (for Elastic Search) / load-to-open-search.log (for
OpenSearch)
Additionally, you can also refer to the ES or OS cluster logs, where you had configured ES or OS.
- It creates the index for the given Dataset and loads the data into
the index table based on values provided in the index.pipeline-id
argument.
Note:
In systems where the delta is already derived by means of other techniques/ processes and the system is sure about the nature of data as a "true delta"; it is possible to skip the delta-computation within ER for faster turnaround in Create Index and Load the Data Job. In such cases, the input from PRE tables is considered to be the actual delta. This could be achieved by setting a batch parameter value accordingly.
To skip delta computation, the "deltaComputed" parameter in <job1_script script name> should be set to 'true' (including single quotes). Any input from _PRE tables is assumed to be delta (modified/new records). Note that deltaComputed is considered only when Create Index and Load the Data job is executed with the load type as DeltaLoad.
Previous execution _CHUNKED (example: H$STG_PARTY_MASTER_PRE_101_CHUNKED_1) tables are not required while executing Create Index and Load the Data job with deltaComputed as 'true'. If you are planning to execute Create Index and Load the Data job with deltaComputed as true for every time/always, the chunk creation during Create Index and Load the Data job can be skipped by setting the F_CREATE_CHUNKS value as false in the FCC_ER_CONFIG table in FSDF schema.
Configuration for Create Index and Load the Data
Full View Table (FCC_ER_FULL) Initrans: A high number of parallel processes require a table to have a higher INITRANS value. The maximum number of parallel processes during a MERGE operation on the FCC_ER_FULL can be configured using SINGLETON_TASK_PARALLEL_LEVEL parameter.
To configure SINGLETON_TASK_PARALLEL_LEVEL parameter, see the Additional Configurations section.
- Update the metadata under V_MAKE_TABLE_QUERIES column in the FCC_STUDIO_ER_QUERIES table in Studio Schema for the active ER pipeline. For example, CSA_812.
- Select V_MAKE_TABLE_QUERIES from the fcc_studio_er_queries where
DF_NAME= '<ACTIVE ER DF_NAME>' and V_PIPELINE_ID = '<ACTIVE ER PIPELINE
ID>';
For example:
Select V_MAKE_TABLE_QUERIES from fcc_studio_er_queries where DF_NAME= 'Customer812' and V_PIPELINE_ID = 'CSA_812';
- Search for "N_CUSTOM_INITRANS NUMBER" and only set the custom value
if required.
For example, N_CUSTOM_INITRANS NUMBER := 50;
- Commit the changes.
Steps
- Navigate to
<COMPLIANCE_STUDIO_INSTALLATION_PATH>/deployed/ficdb/bin
directory. - Run the following
command:
nohup ./ER_Create_And_Load_Data_Into_Index.sh "<PIPELINE_ID>" "<ER_SCHEMA_WALLET_ALIAS>" "<LOAD_TYPE>" "<FIC_MIS_DATE>" "<FSDF_VERSION>" "<BATCH_GROUP>" "<SOURCE_BATCH>" "<DATA_ORIGIN>" "<RUN_TYPE>" &
Note:
- <BATCH_GROUP> refers to the FCC_PROCESSING_GROUP table in the Compliance Studio schema.
- <SOURCE_BATCH> and <DATA_ORIGIN> are not relevant now as execution parameters and they are added for future use.
For example, you can use the following command for CSA_8129 pipeline.
FSDF 8129 version:
nohup ./ER_Create_And_Load_Data_Into_Index.sh "CSA_8129" "ER_SCHEMA_PP_ALIAS" "FullLoad" "20151210" "8129" "CSA_812" "CSA_812" "US" "RUN" &
For more information about parameters, see the Parameters for Entity Resolution Job execution section.