Understanding Data Validation and Error Handling in the ETL Process

Accurate reporting is completely dependent on the data stored in data warehouse OWE and MDW tables; if incomplete or incorrect data resides in these tables, reporting and analysis can be flawed. Given the considerable dependence on data in EPM tables, all source data entering EPM must be validated.

Data validations are performed when you run ETL jobs. Because we want to ensure that complete, accurate data resides in the OWE and MDW tables, data validations are embedded in the jobs that load data from the OWS to the OWE and MDW. Therefore, data that passes the validation process is loaded into OWE and MDW target tables, while data that fails the validation process is redirected to separate error tables in the OWS. This ensures that flawed data never finds its way into the target OWE and MDW tables.

Error tables log the source values failing validation to aid correction of the data in the source system. There is an error table for each OWS driver table. OWS driver tables are those tables that contain the primary information for the target entity (for example customer ID).

Data Completeness Validation and Job Statistic Summary for Campus Solutions, FMS, and HCM Warehouses

A separate data completeness validation and job statistic capture is performed against the data being loaded into Campus Solutions, FMS, and HCM MDW tables (for example, validating that all records, fields, and content of each field is loaded, determining source row count versus target insert row count, and so forth). The validation and job statistic tracking is also performed in ETL jobs. The data is output to the PS_DAT_VAL_SMRY_TBL and PS_DATVAL_CTRL_TBL tables with prepackaged Oracle Business Intelligence (OBIEE) reports built on top of the tables.

See PeopleSoft EPM: Fusion Campus Solutions Intelligence for PeopleSoft.

Understanding the Data Validation Mechanism

The following graphic represents the data validation-error handling process in the PeopleSoft delivered J_DIM_PS_D_DET_BUDGET job:

Image: Data validation in the J_DIM_PS_D_DET_BUDGET job

This example illustrates the Data validation in the J_DIM_PS_D_DET_BUDGET job.

Note that two hashed file validations are performed on the source data: the HASH_PS_PF_SETID_LOOKUP (which validates SetID) and HASH_PS_D_DT_PATTERN (which validates pattern code). Any data failing validation of these lookups is sent to the OWS error table (DRS_PS_ES_CAL_BP_TBL) via the Load_Error_PF_SETID_LOOKUP and Load_Error_D_DT_PATTERN_LOOKUP.

A closer look at the stage variables in the Trans_Gen_Key transformer stage demonstrate how the data validation process works (open the stage, then click the Show/Hide Stage Variables button on Transformer toolbar):

Image: Trans_Gen_Key transformer stage variables

This example illustrates the Trans_Gen_Key transformer stage variables. .

Note that the ErrorFoundSetID and ErrorFoundDDTPATTERN stage variable derivations are set to Y if the SETID lookup or pattern code validations fail.

Image: Stage variable properties

This example illustrates the Stage variable properties.

The value of the ErrorFound stage variable, however, depends on the values of the ErrorFoundSetID and ErrorFoundDDTPATTERN stage variables, as well as the value of the $ERR_VALIDATE parameter, which can be configured to Y or N. If the $ERR_VALIDATE parameter is set to Y, rows that fail validation are written to the error table. If the value is set to N, rows that fail validation still pass to the target table.

Also note the AbortJob stage variable derivation uses the $ERR_THRESHOLD parameter to limit the number of error records allowed in the job. If the number of error records exceed the value set for the $ERR_THRESHOLD parameter, the job automatically aborts. For example, if $ERR_THRESHOLD is set to 50, the job aborts if the number of records with errors exceeds 50. You can set the value of the $ERR_THRESHOLD parameter to meet your specific business requirements.

Using the SetID lookup validation as an example, if a record fails validation, a Y value is assigned to the ErrorFoundSetID stage variable. If the $ERR_VALIDATE parameter is also set to Y, the failed record is sent to the PS_ES_CAL_BP_TBL error table.

Image: Example constraint

This example illustrates the Example constraint.

For records that pass validation, an N value is assigned to the ErrorFound stage variable and the records are sent to the target table.

Image: N value is assigned to the ErrorFound stage variable

N value is assigned to the ErrorFound stage variable

Disabling Data Validation

You can disable error validation in OWS jobs by configuring the value of the $ERR_VALIDATE parameter. By default the value is set to Y, which means that records failing validation are moved an error table. If you set the $ERR_VALIDATE value to N, records failing validation will still pass to the target table.

Understanding the Data Completeness Validation and Job Statistic Summary Mechanism

The ETL component of the new data validation feature can be found using the following navigation in the DataStage Designer repository window: Jobs, Data_Validation.

Image: Data_Validation jobs in the DataStage project tree

This example illustrates the the Data_Validation jobs in the DataStage project tree.

Data Validation - Job Summary Data

The ETL logic for the new data validation feature is contained in a reusable, common component called SEQ_J_Handle_DATA_VALIDATION, which is a sequencer job:

Image: SEQ_J_Handle_DATA_VALIDATION sequencer job

This example illustrates the SEQ_J_Handle_DATA_VALIDATION sequencer job.

Since the logic is reusable, the sequencer job is incorporated into all the existing sequencer jobs that load the MDW. Also, because the logic is contained within a sequencer job, there is no need to modify existing server jobs to implement the logic.

Here is the SEQ_J_Handle_DATA_VALIDATION sequencer job incorporated into an existing sequencer job for the Campus Solutions Warehouse, SEQ_J_Fact_PS_F_ADM_APPL_STAT:

Image: SEQ_J_Fact_PS_F_ADM_APPL_STAT sequencer job with the data validation component

SEQ_J_Fact_PS_F_ADM_APPL_STAT sequencer job with the data validation component.

Note: If you have customized your ETL jobs, you can implement the new data validation component simply by attaching it to the associated sequencer job.

The SEQ_J_Handle_DATA_VALIDATION sequencer job consists of the following server jobs:

J_Load_PS_DATA_VALIDATE_ERRTBL
J_Load_PS_DATA_VALIDATE_NOERRTBL
J_Load_DATAVALIDATE_NOTARGET
J_Load_PS_DATA_VALIDATE_ERRTBL_NOTARGET

A closer look at these server jobs shows that they load the Data Validation Summary (PS_DAT_VAL_SMRY_TBL) table:

Image: J_Load_PS_DATA_VALIDATE_NOERRTBL

J_Load_PS_DATA_VALIDATE_NOERRTBL

The Data Validation Summary (PS_DAT_VAL_SMRY_TBL) table consolidates all your job run statistic and error data, and is used as the foundation for the delivered data validation OBIEE reports (which are discussed in the PeopleSoft EPM Fusion Campus Solutions Intelligence for PeopleSoft documentation).

The following graphic and table provide a more detailed look at the columns included in the Data Validation Summary (PS_DAT_VAL_SMRY_TBL) table:

Image: PS_DAT_VAL_SMRY_TBL

This example illustrates the PS_DAT_VAL_SMRY_TBL.

Column	Description
JOB_NAME	Populated with the server job name for each job run. This is a composite Primary Key.
BATCH_SID	Contains the BATCH_SID for the corresponding job run. This is a composite Primary Key.
SRC_COUNT	Contains the row count from the source that was extracted by the job.
TGT_INS_COUNT	Contains the row count for rows inserted into the target table by the job.
TGT_UPD_COUNT	Contains the row count for rows updated in the target table by the job.
TGT_DEL_COUNT	Contains the row count for rows deleted from the target table by the job.
ERROR_ROWS	Contains the count of unique rows from the source that failed to load the target due to lookup validation failure. Will default to '-' when: Unique error count cannot be captured (for example, jobs having multiple error tables). Jobs use parent-child relationship tables, but the error data is captured at the parent table level only.
JOB_START_DTTM	Contains the server job start time. This will be the LASTUPD_EW_DTTM in the target tables.
JOB_END_DTTM	Contains the server job end time.
JOB_RUN_STATUS	Contains the run status of a job and whether the data has been loaded into the target completely. Job run status values are: Success: Indicates the ETL Job ran to success and all rows from the source are loaded to the target successfully. Hence the source count will match with sum of target rows. Success with Error Rows: Indicates the ETL Job ran to success but some rows are rejected to Error Table. In this case the source count will tally with sum of target rows and error rows, provided EXCEPTION_FLAG is set to 'N.' Success with Data Mismatch: Indicates the ETL Job ran to success but source count does not tally with the sum of target counts and no rows are inserted into error table. Warning: Indicates the ETL Job finished with warnings. In this case the source count might not tally with the sum of target rows and error rows. Failed: Indicates the ETL Job aborted. In this case the source count will not tally with the sum of target rows and error rows. Note: Even if data validation is disabled for a job, job run status is still captured so that you can use the delivered Job Run Statistics OBIEE report to monitor job status.
DATA_VAL_FLAG	Flag indicating whether the data validation flag is enabled or disabled for the server job.
ERROR_DETAILS	Contains the error table names and total count of rows loaded into each error table. When no rows are loaded into error table it will be defaulted to "-"
LASTUPDOPRID	Contains the DataStage user information who triggered or ran the job.
TARGET_TABLE	Contains the target table name to which data was loaded.

Data Validation - Job Statistic Data

The SEQ_J_Handle_DATA_VALIDATION sequencer job also includes a mechanism to capture job statistic data. Note that within the sequencer job is the routine GetJobCntrlInfo:

Image: GetJobCntrlInfo routine in the SEQ_J_Handle_DATA_VALIDATION sequencer job

This example illustrates the GetJobCntrlInfo routine in the SEQ_J_Handle_DATA_VALIDATION sequencer job.

The GetJobCntrlInfo routine retrieves job information (such as error table and exception flag) from the Data Validation Control (HASH_PS_DATVAL_CTRL_TBL) hashed file:

Image: GetJobCntrlInfo routine detail

This example illustrates the GetJobCntrlInfo routine detail.

The job information in the hashed file is also used by the server jobs that load the Data Validation Summary (PS_DAT_VAL_SMRY_TBL) table.

The Data Validation Control (PS_DATVAL_CTRL_TBL) table stores job statistic data for each OWS to MDW job run, such as source count, target count, error count, and error table list. The table is delivered prepopulated with the necessary data for the ETL jobs that perform data validation.

Note: If you have customized your ETL Jobs, you must manually update/insert data for the customized job in the PS_DATVAL_CTRL_TBL table and then run the J_Hash_PS_DATVAL_CTRL_TBL server job to update the related hashed file. See the following topic on how to perform these tasks.

The following graphic and table provide a more detailed look at the columns included in the Data Validation Control (PS_DATVAL_CTRL_TBL) table:

Image: PS_DATVAL_CTRL_TBL

This example illustrates the PS_DATVAL_CTRL_TBL.

Column	Description
JOB_NAME	Populated with the server job name that loads the OWS and MDW table. This is a primary key in this table.
SEQ_JOB_NAME	Populated with the Sequencer Name of the server job.
SOURCE_TBL_LIST	Contains the list of source table names used in the job.
TARGET_TBL_LIST	Contains the list of target target table names used in the job. This column may also contain parameter names when the target table is parameterized.
ERROR_TBL_LIST	Contains the list of error tables populated in the job.
TGT_DEL_STG_LNK	Identifies the active stage that precedes the DRS stage that performs a delete on target table, which is used to retrieve the target deleted row count.
SRC_ACT_STG_LNK	Identifies the active stage that succeeds the source DRS stage, which is used to retrieve the source row count.
ERR_TBL_QRY	Contains the error table query that is used to fetch the count of unique errors rows that failed to load the target.
EXCEPTION_FLAG	Contains 'Y' for jobs when source row count might not tally with the sum of target and error row counts (due to functional or job design). Otherwise, contains 'N.'
GEN_SQL_FLAG	Contains 'Y' if DRS stage uses generated sql query. Otherwise, contains 'N' if DRS stage uses user-defined SQL query.
JOB_CATEGORY	Contains the category (folder) information in the DataStage Project Repository for the job.
WH_NAME	Contains the EPM Warehouse name related to the job. The data validation component uses this value to select the appropriate parameter file at runtime.
MART_NAME	Contains the data mart name that corresponds to the parent EPM Warehouse, for the job run.
JOB_TYPE	Contains the job type information for the job. Job type values include: STAGING DIMENSION FACT
SUBJECT_NAME	Contains the functional area that corresponds to the parent data mart, for the job.
LASTUPDOPRID	Contains the user information associated with the insert or update actions, for a given job.
LASTUPDDTTM	Contains the timestamp when data was last modified.

Another data validation-job statistic table, the Data Validation Job Error (PS_DATVAL_JOB_ERR) table, is populated from the PS_DATVAL_CTRL_TBL by an in-house ETL job not delivered with DataStage metadata.

The Data Validation Job Error table is used in Error Table OBIEE reports and contains specific error table information and corresponding source table information for each job. The Data Validation Job Error table is populated from the PS_DATVAL_CTRL_TBL by an in-house ETL job not delivered with DataStage metadata.

The following graphic and table provide a more detailed look at the columns included in the Data Validation Job Error (PS_DATVAL_JOB_ERR) table:

Image: PS_DATVAL_JOB_ERR

This example illustrates the PS_DATVAL_JOB_ERR.

Column	Description
SEQ_NBR	Contains the unique sequence number for each error table.
JOB_NAME	Populated with the server job name for each job run.
ERROR_TABLE	Contains the error table name related to the job.
SOURCE_TABLE	Contains the name of the source table used with a corresponding error table.
LASTUPDOPRID	Contains the user information associated with the insert or update actions, for a given job.

Enabling or Disabling the Data Completion Validation Feature

The Data Validation and Error Reporting feature is designed as optional, and you can enable it or disable it for each staging job using the delivered parameter files. For example, you can access the HCM_DATA_VALIDATION_SETUP parameter file and change the DATA_VALIDATION value from 'N' (do not perform data validation) to 'Y' (perform data validation) for each job:

Image: Campus Solutions Warehouse Data Validation parameter file

This example illustrates the Campus Solutions Warehouse Data Validation parameter file.

PeopleSoft delivers a separate parameter file for each EPM Warehouse and for Global/Common dimension jobs:

CS_DATA_VALIDATION_SETUP.txt
FMS_DATA_VALIDATION_SETUP.txt
HCM_DATA_VALIDATION_SETUP.txt
GLOBAL_DATA_VALIDATION_SETUP.txt

Implementing the Data Completeness Validation Feature for Customized Jobs

PeopleSoft provides the DDL Scripts and related data files for the Data Completeness Validation setup tables, but if you have created new jobs or customized delivered jobs, you must perform the following steps to implement the Data Completion Validation feature:

Populate the PS_DATVAL_CTRL_TBL.

For each customized job, you must populate this table using an SQL INSERT statement. Use the following table as a guide to populating the PS_DATVAL_CTRL_TBL fields:

Column	Description
JOB_NAME	Populate this field with the server job that loads the OWS and MDW table.
SEQ_JOB_NAME	Populate this field with the Sequencer job that triggers the customized job.
SOURCE_TBL_LIST	Populate this field with source tables used in the customized job. Use commas to separate multiple table names.
TARGET_TBL_LIST	Populate this field with target tables used in the customized job. If target table name is a job parameter, use this sample format: #TargetTable#(PS_FF_STDNT_ENRL/PS_TF_STDNT_ENRL), where TargetTable is the Job Parameter
ERROR_TBL_LIST	Populate this field with error tables used in the customized job.
TGT_DEL_STG_LNK	Populate this field with the active stage name that precedes the DRS stage that performs a delete on the target table. If the active stage contains more than one output link, the link name should also be specified using this format: <active_stage_name>@<link_name> where link_name is the link connecting to the Delete DRS stage and active_stage_name is the transformer name. For example: "PS_R_STDNT_ATTR_split@IPC_Del_Out*-" is value for this field in J_Dim_PS_R_STDNT_ATTR job
SRC_ACT_STG_LNK	For CRC staging jobs, populate this field with the active stage name and link name that writes to the Delta hash file. Use the following format: <active_stage_name>@<link_name> where link_name is the link that updates the delta hash file in the CRC job. For non CRC staging jobs, populate this field with the active stage name that follows the source DRS stage in the customized job.
ERR_TBL_QRY	Populate this field with the error table query that is used to fetch the count of unique errors rows that failed to load the target. For customized jobs having one error table, use the following query syntax: SELECT DISTINCT <BUSINESS KEYS of source table> FROM #$ERROR_TBL_SCHEMA#<ERROR TABLE NAME>Sample: SELECT DISTINCT SRC_SYS_ID,EFFSEQ,EFFDT,EMPL_RCD,EMPLID FROM #$ERROR_TBL_SCHEMA#PS_E_JOB This field is defaulted to "-" for customized jobs with more than one error table.
EXCEPTION_FLAG	Set this flag to "Y" for customized jobs with more than one error table. Set this flag to "Y" for customized jobs that include multiple source tables with parent/chid relationships and the error data is captured only at parent table level.
GEN_SQL_FLAG	Set this flag to “Y” for customized jobs that use generated SQL in the Source DRS stage. Set this flag to “N” when the source DRS stage includes user-defined SQL.
JOB_CATEGORY	Populate this field with the category (folder) name where the job is placed within the DataStage Project Repository.
WH_NAME	Populate this field with the EPM Warehouse name related to the customized job.
MART_NAME	Populate this field with the data mart name that corresponds to the customized job.
JOB_TYPE	Populate this field with the job type information for the customized job. Job type values include: STAGING DIMENSION FACT
SUBJECT_NAME	Populate this field with the functional area that corresponds to the customized job.
LASTUPDOPRID	Populate this field with the user ID associated with the insert or update actions for the customized job.
LASTUPDDTTM	Populate this field with the current timestamp representing when data was last modified for the customized job.

Here is a sample of SQL INSERT syntax:

  INSERT INTO "PS_DATVAL_CTRL_TBL"
   ( 
    JOB_NAME, SOURCE_TBL_LIST, TARGET_TBL_LIST, 
    ERROR_TBL_LIST, TGT_DEL_STG_LNK, SRC_ACT_STG_LNK,
    ERR_TBL_QRY,
    EXCEPTION_FLAG, GEN_SQL_FLAG, WH_NAME, MART_NAME,
    SUBJECT_NAME, JOB_TYPE, JOB_CATEGORY,
    LASTUPDOPRID, LASTUPDDTTM, SEQ_JOB_NAME
   )
   VALUES
   (
    'J_Fact_PS_F_KK_EXCEPTION','PS_KK_EXCPTN_TBL','PS_F_KK_EXCEPTION',
    'PS_E_KK_EXCEPTION','Trans_Assign_Values8@Trans_Assign_Values8_out_DEL*-',
    'IPC_SRC.IDENT1',
    'SELECT DISTINCT
    SRC_SYS_ID,EXCPTN_TYPE,LEDGER_GROUP,KK_TRAN_LN,KK_TRAN_DT,KK_TRAN_ID
    FROM #$ERROR_TBL_SCHEMA#PS_E_KK_EXCEPTION',
    'N','Y','FMS','GL_And_Profitability_Mart',
    'Commitment_Control','FACT',
    'Jobs\FMS_E\GL_And_Profitability_Mart\Commitment_Control\OWS_To_MDW\Facts\Base\Lo    ad_Tables\Server',
    'EPM',to_timestamp('01-JAN-53 12.00.00.000000000 AM','DD-MON-RR HH.MI.SS.FF AM'),    'SEQ_J_Fact_PS_F_KK_EXCEPTION'
   );

Add an entry in Data Validation Parameter File.

For warehouse specific Jobs, add an entry in the corresponding parameter file. For example, customized jobs specific to Campus Solutions require an entry in the CS_DATA_VALIDATION_SETUP.txt and you could use the following format:
```
[J_Dim_PS_D_ACAD_CAR]
DATA_VALIDATION = N
```
For Common Dimensions and Global Dimensions, add an entry in GLOBAL_DATA_VALIDATION_SETUP.txt.

To enable data load validation (capturing source, target and error row counts for each run), the DATA_VALIDATION flag should be set to ‘Y’.
Run J_Hash_PS_DATVAL_CTRL_TBL job.

Note: It is important that the information for a customized server job is accurately reflected in the PS_DATVAL_CTRL_TBL before running this job.
Attach the common component, SEQ_J_Handle_DATA_VALIDATION, sequencer job to the existing sequencer job that corresponds to the customized server job.

Copy the SEQ_J_Handle_DATA_VALIDATION component and paste it into the existing sequencer job that corresponds to the customized server job. Reestablish the links in the new job sequence, map the Job Name and BatchSID parameter in the Job Activity, and verify that hash file directory path, parameter file directory path, and the database related parameters are mapped correctly. Then compile the job.

When you complete these steps, the PS_DATVAL_SMRY_TBL captures job run statistics for each job run by the data validation component.

Populate the PS_DATVAL_JOB_ERR.

For each customized job, you must populate this table using an SQL INSERT statement. Use the following table as a guide to populating the PS_DATVAL_JOB_ERR fields:

Column	Description
SEQ_NBR	Populate this field with a unique sequence number for each error table. To avoid overlapping SEQ_NBR, use higher SEQ_NBR (values greater than 999999).
JOB_NAME	Populate this field with the customized server job name.
ERROR_TABLE	Populate this field with the error table name used in the customized job.
SOURCE_TABLE	Populate this field with the name of the source table used with the corresponding error table.
LASTUPDOPRID	Populate this field with the user ID associated with the insert or update actions for the customized job.

Here is a sample of SQL INSERT syntax:

INSERT INTO "PS_DATVAL_JOB_ERR" 
(SEQ_NBR,JOB_NAME,ERROR_TABLE,SOURCE_TABLE,LASTUPDOPRID)
VALUES
(236,'J_Dim_PS_D_GM_SPONSOR','PS_E_GM_CUSTOMER','PS_GM_CUSTOMER','dsuser');

Populate PS_DATVAL_SRC_TBL.

For each customized job, you must populate this table using an SQL INSERT statement. Use the following table as a guide to populating the PS_DATVAL_SRC_TBL:

Column

Description

SEQ_NBR

Populate this field with a unique sequence number for each error table.

To avoid overlapping SEQ_NBR, use higher SEQ_NBR (values greater than 999999).

ERROR_TABLE

Populate this field with the error table name used in the customized job.

SOURCE_TABLE

Populate this field with the name of the source table used with the corresponding error table.

Here is a sample of SQL INSERT syntax:

INSERT INTO "PS_DATVAL_SRC_TBL" 
(SEQ_NBR,ERROR_TABLE,SOURCE_TABLE)
VALUES
(351,'PS_E_WA_COMP_PIN','PS_WA_COMP_PIN_MAP');