Create Index and Load the Data
Note:
When the MATCHING_MECHANISM parameter is OS, ensure that you have configured the Logstash parameter as true (index.logstashconf.apply) in the load-to-open-search
application.properties to load data from the Database.
Job
ER_Create_And_Load_Data_Into_Index.sh performs the following:
- It creates all the output tables required at the different stages of Entity resolution tasks.
- Input to this job will be pipeline id as an argument so that all the tables related to that pipeline ID will be created.
- Index view table, Matching output table, Manual matches output table, Merge Map output table, Manual map merge output table, final dataset output tables. This task will create all these tables.
- When processing high-volume data, the index-loading step in this job may take longer for the current FIC_MIS_DATE as well as the next FIC_MIS_DATE execution. In this case, you need to refer to the Compliance Studio log files present in the
<COMPLIANCE_STUDIO_INSTALLATION_PATH>/deployed/logsdirectory. The log files are:- er-batch.log
- load-to-open-search.log (for OpenSearch)
Additionally, you can also refer to the OS cluster logs, where you had configured OS.
- It creates the index for the given Dataset and loads the data into the index table based on values provided in the index.pipeline-id argument.
Note:
In systems where the delta is already derived by means of other techniques/ processes and the system is sure about the nature of data as a "true delta"; it is possible to skip the delta-computation within ER for faster turnaround in Create Index and Load the Data Job. In such cases, the input from PRE tables is considered to be the actual delta. This could be achieved by setting a batch parameter value accordingly.
To skip delta computation, the "deltaComputed" parameter in <job1_script script name> should be set to 'true' (including single quotes). Any input from _PRE tables is assumed to be delta (modified/new records). Note that deltaComputed is considered only when Create Index and Load the Data job is executed with the load type as DeltaLoad.
Previous execution _CHUNKED (example: H$STG_PARTY_MASTER_PRE_101_CHUNKED_1) tables are not required while executing Create Index and Load the Data job with deltaComputed as 'true'. If you are planning to execute Create Index and Load the Data job with deltaComputed as true for every time/always, the chunk creation during Create Index and Load the Data job can be skipped by setting the F_CREATE_CHUNKS value as false in the FCC_ER_CONFIG table in FSDF schema.
- Index Reconciliation is implemented within ER Job 1 to ensure that the output of ER Job 1 (FCC_ER_FULL) is in syncronization with the OS indexes. Validation is performed both before and after the successful execution of the job.
Parameter skipIdxReconciliation is introduced to handle cases where ER batch is executed without OS index load. This is an optional parameter and only required to be configured to bypass the index validation. If required, it can be configured in ER_Create_And_Load_Data_Into_Index.sh. The parameter's default value is "false" that is index validation is always performed. If this parameter is “false” for an execution where index loading is disabled, ER Job 1 will validate the OS indexes with FCC_ER_FULL resulting into failure due to inconsistency.
Follow below steps to configure this parameter for ER run without Index load:- For the current run (Example: Day 20) where index loading is to be disabled, configure below parameters In ER_Create_And_Load_Data_Into_Index.sh before executing the jobs:
- Set parameter loadToESNeeded=0. This is to disable the index loading.
- Set skipIdxReconciliation=true. This is to skip index validation post execution of the job. Index is still validated before execution of the job if previous run was executed with index loading.
- Execute ER Jobs 1, 3 and 4.
- For the next ficmisdate execution (Day 21):
- Revert loadToESNeeded=1 to resume index loading for the subsequent runs.
- Ensure that skipIdxReconciliation=true . Ensure to set this to true for the next day to bypass pre-check index validation before execution of the next day run. This can be set to “false” for the subsequents runs.
- Execute ER jobs 1 to 4.
- For subsequent ER runs (day 22 onwards), set skipIdxReconciliation=false.
Note:
Case when skipIdxReconciliation is set to true without skipping index loading for any runs:- If index loading is not skipped for the current ficmisdate date, and skipIdxReconciliation is set to true. The index validation will still take place post job execution.
- If index loading is not skipped in the previous ficmisdate date, and skipIdxReconciliation is set to true in current run the index validation will still take place before the job execution.
- For the current run (Example: Day 20) where index loading is to be disabled, configure below parameters In ER_Create_And_Load_Data_Into_Index.sh before executing the jobs:
Configuration for Create Index and Load the Data
Full View Table (FCC_ER_FULL) Initrans: A high number of parallel processes require a table to have a higher INITRANS value. The maximum number of parallel processes during a MERGE operation on the FCC_ER_FULL can be configured using SINGLETON_TASK_PARALLEL_LEVEL parameter.
To configure SINGLETON_TASK_PARALLEL_LEVEL parameter, see the Additional Configurations section.
- Update the metadata under V_MAKE_TABLE_QUERIES column in the FCC_STUDIO_ER_QUERIES table in Studio Schema for the active ER pipeline. For example, CSA_813.
- Select V_MAKE_TABLE_QUERIES from the fcc_studio_er_queries where DF_NAME= '<ACTIVE ER DF_NAME>' and V_PIPELINE_ID = '<ACTIVE ER PIPELINE ID>';
For example:
Select V_MAKE_TABLE_QUERIES from fcc_studio_er_queries where DF_NAME= 'Customer813' and V_PIPELINE_ID = 'CSA_813';
- Search for "N_CUSTOM_INITRANS NUMBER" and only set the custom value
if required.
For example, N_CUSTOM_INITRANS NUMBER := 50;
- Commit the changes.
Steps
- Navigate to
<COMPLIANCE_STUDIO_INSTALLATION_PATH>/deployed/ficdb/bindirectory. - Run the following command:
nohup ./ER_Create_And_Load_Data_Into_Index.sh "<PIPELINE_ID>" "<ER_SCHEMA_WALLET_ALIAS>" "<LOAD_TYPE>" "<FIC_MIS_DATE>" "<FSDF_VERSION>" "<BATCH_GROUP>" "<SOURCE_BATCH>" "<DATA_ORIGIN>" "<RUN_TYPE>" &Note:
- <BATCH_GROUP> refers to the FCC_PROCESSING_GROUP table in the Compliance Studio schema.
- <SOURCE_BATCH> and <DATA_ORIGIN> are not relevant now as execution parameters and they are added for future use.
For example, you can use the following command for CSA_813 pipeline.
FSDF 813 version:
nohup ./ER_Create_And_Load_Data_Into_Index.sh "CSA_813" "ER_SCHEMA_PP_ALIAS" "FullLoad" "20151210" "813" "CSA_812" "CSA_812" "US" "RUN" &For more information about parameters, see the Parameters for Entity Resolution Job execution section.