9 Data Processing

To run transformations and validation checks, go to the Home page. To load data, go to the Study Configuration/Clinical Data Models page in the InForm or File Watcher tab.

Data Processing Types and Modes

Oracle Health Sciences Data Management Workbench (Oracle DMW) supports two types of data processing: Reload Processing and Unit of Work Processing (UOW). Both types can be run in either Full or Incremental mode. UOW processing has an additional modeâUOW Loadâthat is available only for loading data.

The definition of the target table determines the default data processing typeâReload or UOWâthat the system uses for a job. You select the modeâFull, Incremental, or Loadâwhen you schedule or submit a job.

Both types of processing require a primary key on the target Table instance. It is not necessary to have a primary or unique key defined in an external source. For example, external SAS files that do not have a primary key defined can be loaded as long as the target table has a primary key.

The target tables of validation checks, which are created by the system, use Reload processing.

Reload Processing

In Reload processing, the system processes all records in the source tables or table-level files and compares the primary keys of the source records with the primary keys of the records already in the target tables, if any, to determine which records to insert, update, and (in Full mode only) delete. If a reloaded record does not include any data changes, the system simply changes its refresh timestamp to the timestamp of the current job.

Full

Full Reload processing performs insertions, updates, refreshes, and deletions: if a record is not included as input but its table is included, the system soft-deletes the recordâthe record still exists in the database but is no longer available for use. The audit trail for the record is available.

If an entire table is not included in a data load, full reload does not delete the data in the target table. To delete all data in a target table, load an empty file for the table.

Note:

Unless you need to delete data in a particular table, be careful to use Full mode only if you are confident that the data being loaded or read is the complete set of current data.

Incremental

Incremental Reload processing performs insertions, updates, and refreshes but no deletions. If a record is not reloaded it remains available but its timestamp is not updated.

Incremental processing is faster than full, so you may want to use incremental processing frequently and full processing less frequently but regularly, to ensure that data is appropriately deleted.

Unit of Work Processing

A Unit of Work (UOW) is all records associated with either a particular subject or a particular subject visit. Tables with a UOW processing type must have either the subject or subject and visit columns marked as SDTM identifiers and the Unit of Work must also be defined: either Subject or Subject/Visit.

UOW processing reads all source records and notes the UOW keyâthe value of the subject or subject and visit columnsâfor each record, and adds the UOW key to an execution setâthe set of subjects or subject visits to be processed. The system then processes only records for the units of work represented in the execution set and no records for other units.

See Table 9-1, "Deletion Behavior in Unit of Work Processing Modes" to compare UOW modes.

Target tables defined for Unit of Work processing also accept Reload processing.

Full UOW

Full UOW processing examines timestamps in the source UOW tables to determine which records have changed since the last Full UOW or Full Reload processing job and creates the execution set for those records' UOW key (subject or subject visit) and no others. If the program has never been run in any mode, all units of work are included in the execution set. It inserts, updates, and refreshes all records belonging to subjects or subject visits in the execution set and no others. If records previously existed in the target tables it also deletes records:

Records in processed units of work (subject or subject visit) that are not reloaded are deleted.
Entire units of work are deleted if they exist in the target table but are not reloaded.

For example, if a transformation reads from an Adverse Event table and writes to a Severe Adverse Events table and has previously inserted records with a Serious flag set to Y for a particular subject, a change to the Serious flag to N for all of a subject's records results in no records being inserted and, since the subject is in the UOW execution set, the deletion of all records for that subject from the target.

Incremental UOW

Incremental UOW processing examines records' timestamps in the source UOW tables to determine which records have changed since the last processing job in any mode (UOW or Reload, Full or Incremental), and creates the execution set for those records' UOW key (subject or subject visit) and no others. If the program has never been run in any mode, all units of work are included in the execution set. It performs insertions, updates, and refreshes but no deletions.

As with Reload processing, UOW's Incremental mode is faster than Full mode, so you may want to use incremental UOW processing frequently and full UOW processing less frequently but regularly, to ensure that data is appropriately deleted.

Note:

The end result of Incremental UOW processing is the same as Incremental Reload processing. Both process all new and changed records and delete no records. The end result of the two Full modes is also the same.

Which process will be faster depends on the volume of changed data being processed and whether changes are concentrated in specific units of work or spread fairly evenly across all units. Compared to Reload, UOW processing has overhead costs in detecting affected subjects or subject/visits, but it is more efficient in that it processes records only in units with changes, not all records.

In general, UOW will probably be faster than Reload when the number of incremental changes is small or concentrated in relatively few units of work.

In addition, if you use custom programs in transformations, using UOW processing takes care of finding incremental data changes; your code does not need to handle that.

UOW Load

After loading data the system identifies the distinct set of UOW keys for the records that were inserted, modified or refreshed in the load, creates an execution set consisting of these units of work, and processes all records within these units of work and no others. Any records in a unit of work included in the execution set that is not included in the file is deleted.

This is the only type of UOW processing available during data loading. If you want different deletion behavior you can use Reload processing ; see Table 9-2, "Deletion Behavior in Data Loading Processing Modes".

Note:

Use UOW Load mode only if the file being loaded contains the full current set of data (new, modified and unchanged) for subjects or subject visits with any new or modified data because any data not reloaded in processed units of work will be deleted.

Instead, use Incremental Reload processing to load data containing just new or modified records.

Table 9-1 Deletion Behavior in Unit of Work Processing Modes

Unit of Work Mode	Delete within Reloaded Unit?	Delete All Records for Nonreloaded Unit?
UOW Load	Yes	No
Full UOW	Yes	Yes
Incremental UOW	No	No

Data Processing in Transformation Programs

A transformation reads from one or more source data models and writes to a single target data model. The transformation consists of a separate program for each target table. By default, each program uses the processing type of the target tableâReload or UOW. The user who submits or schedules the transformation to run chooses either Full or Incremental mode.

The system uses the default processing type in the selected mode for each target table whenever possible. However, if the target table's processing type is UOW, the system may not be able to use it:

Any source tables that are not themselves identified as UOW tables are processed using Reload because the system cannot determine which subjects or subject visits may have been impacted by the changes. The end result is the same, but processing may be slower.
Similarly, if the target is defined with Subject Visit UOW and a source table is defined as Subject UOW:
- In Subject Visit UOW source tables, the transformation logic processes all records for impacted subject visits in the table and it also processes the records for subjects that are impacted in the Subject UOW source tables.
- In Subject UOW source tables, the transformation logic processes all the impacted subjects.

You can enable UOW processing by defining all tables as UOW whenever possible: when they have subject and visit columns, designate the columns with the corresponding SDTM identifier and set the processing type to UOW Subject Visit or just Subject if there is no Visit column. Do this even if you do not plan to use UOW processing to write to the table to enhance downstream processing performance.

If a source table is not a subject-oriented table, such as a lookup or reference table, and there is any change in data in it since the last execution of the program, the system uses Full Reload processing on all source tables.

Populating Surrogate Keys for Data Lineage Tracing

To enable the system to trace the lineage of every data point back to its source, all types and modes of data processing do the following the first time a transformation is run:

Create one additional column in the target table for each source table; if two source tables feed into a target table, the transformation logic adds two columns to the target table.
Populate each SKEY column with the surrogate key value of the record in the source table. That value is the concatenated primary key values of the record in the source table, separated by pipesâfor example:
```
SUBJID_VALUE|VISINUM_VALUE|TEST_VALUE
```

See "How the System Tracks Data Lineage" for more information.

Automatically Triggering Transformations and Validation Checks by Upstream Processes

You can set up a chain of transformation and validation check executions beginning when you load data into the system or at any point thereafter by setting the Can Trigger attribute of a source model in a transformation to Yes. This ensures that the data in all participating models is always as up-to-date as possible.

Example

A study has Input models called InForm, Central Lab and Local Lab that all feed data to the Review model, which feeds data to the Analysis model, which feeds data into the Submission model.

In the Review model's transformation, for which all Input models are source models, set the Can Trigger attribute to Yes for each Input model. Whenever data is loaded into any of these Input models, the transformation to the Review model is triggered, as are all the validation check batches defined in the Review model.

In the Analysis model's transformation, which has only the Review model as a source in this example, set Can Trigger to Yes for Review as a source model. Whenever data is loaded into the Review modelâas the result of a data load to any of its source models or because of a manual or scheduled executionâthe transformation to the Analysis model is triggered, as well as all validation check batches defined in the Analysis model.

In the Submission model's transformation, you may prefer not to set the source model's Can Trigger attribute to Yes. It may not be important to have the latest data as soon as possible if the purpose of this model is to export final data for regulatory submission, and you may prefer not to use system resources for frequent executions.

Data Processing in Validation Check Programs

The target tables of validation checks, which are created by the system, use Reload Processing. The user can choose Full or Incremental mode when scheduling or submitting the validation check batch.

Loading Data

The system supports loading data from SAS or text files or from InForm:

"Processing Data Loads from Files"
"Processing InForm Data Loads"

You set up data loading when you define input clinical data models.

Processing Data Loads from Files

You can load text or SAS files (data sets, XPORT, or CPORT files) into an input data model. You set up a File Watcher for each input data model. The File Watcher then detects when a data file appears in a specified location and proceeds to load the data. See "Configuring File Watcher" for more information.

The system supports three processing modes for loading data from files:

Full (Reload)
Incremental (Reload)
UOW Load

Note:
Do not use Full mode if the table-level file being loaded contains only new data or a subset of data because any data not reloaded will be deleted.
Do not use UOW Load mode if the table-level file being loaded contains only new data or a subset of data for subjects or subject visits because any data not reloaded in processed units of work will be deleted. Instead, use Incremental (Reload) processing.

Note:
All deletions are "soft" deletions: records have an end timestamp equal to the load's date and time and are no longer available in the system. However, they still exist in the database and have an audit trail.

Table 9-2 Deletion Behavior in Data Loading Processing Modes

Processing Mode	Uses UOW Logic?	Deletion Behavior
UOW Load	Yes	Deletes nonreloaded records within units of work âsubjects or subject visitsâwith new, changed, or refreshed records.
Incremental Reload	No	Does not delete any data.
Full Reload	No	Deletes all records that are not reloaded in tables that are reloaded.

Processing InForm Data Loads

Each InForm study has one InForm input clinical data model for each lifecycle stage. You set up a connection and schedule. See "Configuring the Oracle Health Sciences InForm Connector" for more information.

Format Checks on Loaded Files

The system uses the Oracle SQL Loader to load SAS and text files in both Reload and Unit of Work modes, and makes use of its ability to test that incoming data meets the format requirements of target columns, including data type, length, code list values (if the column is associated with a code list), and the nullable and check constraints.

Records that pass the format checks are inserted into the table one record at a time using the Oracle Insert statement. Records that are rejected are captured and validated for other errors. All the errors found in each record are reported in a file called ComprehensiveErrRpt.csv.

The error file contains one row for each record with an error that says "ORIGINAL ERROR" in the Column Name column and another row for the original error and additional rows for any other errors it finds on the same record. Even though the record is rejected after the first error, the system continues to check all column values for the record and lists all errors in the error file.

Table 9-3, "Error File Example" shows the error file entries for two records. The first record has two errors and the second has one. The string "CDR_W37_1D0156B9.TXT_TV486627301.Name" represents the full path of the erroring field, Name, for Record 1. The first part, "CDR_W37_1D0156B9," is the schema name and the second, "TXT_TV486627301," is the view based on the target table.

Records with errors are not inserted into the target tables.

When you configure File Watcher, you can define the maximum number of rejected records allowed for the load before the load fails using the Max Errors parameter.

Table 9-3 Error File Example

TABLE_NAME	FILE_NAME	REC_NUM	COLUMN_NAME	VALUE	ERROR_MESSAGE
DEMOG	DEMOG.txt	1	ORIGINAL_ERROR	Captured in the TextLoad.log	ORA-20100: ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.NAME")
DEMOG	DEMOG.txt	1	NAME		ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.NAME")
DEMOG	DEMOG.txt	1	HT		ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.HT")
DEMOG	DEMOG.txt	32	ORIGINAL_ERROR	Captured in the TextLoad.log	ORA-20100: ORA-02290: check constraint (("CDR_W37_1D0156B9.TXT_TV486627301 .AGE_CK) violated
DEMOG	DEMOG.txt	32	AGE	05	ORA-02290: check constraint (("CDR_W37_1D0156B9.TXT_TV486627301 .AGE_CK) violated

Supporting Duplicate Primary Keys in a Load

In rare cases you may need to allow a single load of source data to contain records with duplicate primary key valuesâfor example, when loading data from a small lab that may not be able to guarantee uniqueness. In those cases you can define a composite key, but it may not be sufficient to ensure uniqueness.

If you need to support duplicate primary key values within a single data load, check Supports Duplicate when you define the primary key for the target table in the input data model. Selecting this option ensures that all records are loaded and not deleted but requires careful checking of the data.

When you select Supports Duplicate, the system adds the column CDR$DUP_NUM to the target table. During each data load, the system detects whether multiple incoming records have an identical primary key value and:

The system inserts a value of 1 to the CDR$DUP_NUM column for each record with the first occurrence of a set of primary key values.
If another record with the same primary key values is loaded in the same data load, the system inserts a value of 2 into its CDR$DUP_NUM column, and 3 for the third record with the same primary key values, and so on.
During subsequent data loads, the system assumes that the first record with a particular set of primary key values is the same as the existing one with the CDR$DUP_NUM value 1, the second is the same as 2, and so on. If two records exist with values 1 and 2, and the next load contains three records with the same values, the system gives the third record a CDR$DUP_NUM value of 3.

Note:
For this system to work, the lab must always reload all records in the same order, adding new records to the end of the file.

The CDR$DUP_NUM value becomes part of the surrogate key as well as the primary key and is used for data lineage tracing; see "Populating Surrogate Keys for Data Lineage Tracing".