9 Data Processing

This section contains the following topics:

"Data Processing Types and Modes"
"Data Processing in Transformation Programs"
"Data Processing in Validation Check Programs"
"Loading Data"

To run transformations and validation checks, go to the Activities page. To load data, go to the Study Configuration/Clinical Data Models page in the InForm or File Watcher tab.

Data Processing Types and Modes

Oracle Health Sciences Data Management Workbench (Oracle DMW) supports two types of data processing: Reload Processing and Unit of Work Processing (UOW). Both types can be run in either Full or Incremental mode. UOW processing has an additional mode—UOW Load—that is available only for loading data.

The definition of the target table determines the default data processing type—Reload or UOW—that the system uses for a job. You select the mode—Full, Incremental, or Load—when you schedule or submit a job.

Both types of processing require a primary key on the target Table instance. It is not necessary to have a primary or unique key defined in an external source; for example, external SAS files that do not have a primary key defined can be loaded as long as the target table has a primary key.

The target tables of validation checks, which are created automatically by the system, use Reload processing.

Reload Processing

In Reload processing, the system processes all records in the source(s) and compares the primary keys of the source records with the primary keys of the records already in the target table(s), if any, to determine which records to insert, update, and (in Full mode only) delete. If a reloaded record does not include any data changes, the system simply changes its refresh timestamp to the timestamp of the current job.

Full

Full Reload processing performs insertions, updates, refreshes, and deletions: if a record is not included as input, the system soft-deletes it—the record still exists but is no longer available for use. The audit trail for the record is available.

Note:

Do not use Full mode if the source contains only new data or any other subset of data because any data not reloaded will be deleted.

Use Full mode only if you are confident that the data being loaded or read is the complete set of data.

Incremental

Incremental Reload processing performs insertions, updates, and refreshes but no deletions. If a record is not reloaded it remains available but its timestamp is not updated.

Incremental processing is faster than full, so you may want to use incremental processing frequently and full processing less frequently but regularly, to ensure that data is appropriately deleted.

Unit of Work Processing

A Unit of Work (UOW) is all records associated with either a particular subject or a particular subject visit. Tables with a UOW processing type must have either the subject or subject and visit columns marked as SDTM identifiers and the Unit of Work must also be defined: either Subject or Subject/Visit.

UOW processing reads all source records and notes the UOW key—the value of the subject or subject and visit columns—for each record, and adds the UOW key to an execution set—the set of subjects or subject visits to be processed. The system then processes only records for the units of work represented in the execution set and no records for other units.

See Table 9-1, "Deletion Behavior in Unit of Work Processing Modes" to compare UOW modes.

Target tables defined for Unit of Work processing also accept Reload processing.

Full UOW

Full UOW processing examines timestamps in the source UOW table(s) to determine which records have changed since the last Full UOW or Full Reload processing job and creates the execution set for those records' UOW key (subject or subject visit) and no others. If the program has never been run in any mode, all units of work are included in the execution set. It inserts, updates, and refreshes all records belonging to subjects or subject visits in the execution set and no others. If records previously existed in the target tables it also deletes records:

Records in processed units of work (subject or subject visit) that are not reloaded are deleted.
Entire units of work are deleted if they exist in the target table but are not reloaded.

For example, if a transformation reads from an Adverse Event table and writes to a Severe Adverse Events table and has previously inserted records with a Serious flag set to Y for a particular subject, a change to the Serious flag to N for all of a subject's records results in no records being inserted and, since the subject is in the UOW execution set, the deletion of all records for that subject from the target.

Incremental UOW

Incremental UOW processing examines records' timestamps in the source UOW table(s) to determine which records have changed since the last processing job in any mode (UOW or Reload, Full or Incremental), and creates the execution set for those records' UOW key (subject or subject visit) and no others. If the program has never been run in any mode, all units of work are included in the execution set. It performs insertions, updates, and refreshes but no deletions.

As with Reload processing, UOW's Incremental mode is faster than Full mode, so you may want to use incremental UOW processing frequently and full UOW processing less frequently but regularly, to ensure that data is appropriately deleted.

Note:

The end result of Incremental UOW processing is the same as Incremental Reload processing. Both process all new and changed records and delete no records. The end result of the two Full modes is also the same.

Which process will be faster depends on the volume of changed data being processed and whether changes are concentrated in specific units of work or spread fairly evenly across all units. Compared to Reload, UOW processing has overhead costs in detecting affected subjects or subject/visits, but it is more efficient in that it processes records only in units with changes, not all records.

In general, UOW will probably be faster than Reload when the number of incremental changes is small or concentrated in relatively few units of work.

In addition, if you use custom programs in transformations, using UOW processing takes care of finding incremental data changes; your code does not need to handle that.

UOW Load

After loading data the system identifies the distinct set of UOW keys for the records that were inserted, modified or refreshed in the load, creates an execution set consisting of these units of work, and processes all records within these units of work and no others. Any records in a unit of work included in the execution set that is not included in the file is deleted.

This is the only type of UOW processing available during data loading. If you want different deletion behavior you can use Reload processing ; see Table 9-2, "Deletion Behavior in Data Loading Processing Modes".

Note:

Use UOW Load mode only if the file being loaded contains the full current set of data (new, modified and unchanged) for subjects or subject visits with any new or modified data because any data not reloaded in processed units of work will be deleted.

Instead, use Incremental Reload processing to load data containing just new or modified records.

Table 9-1 Deletion Behavior in Unit of Work Processing Modes

Unit of Work Mode	Delete within Reloaded Unit?	Delete All Records for Nonreloaded Unit?
UOW Load	Yes	No
Full UOW	Yes	Yes
Incremental UOW	No	No

Data Processing in Transformation Programs

A transformation reads from one or more source data models and writes to a single target data model. The transformation consists of a separate program for each target table. By default, each program uses the processing type of the target table—Reload or UOW. The user who submits or schedules the transformation to run chooses either Full or Incremental mode.

The system uses the default processing type in the selected mode for each target table whenever possible. However, if the target table's processing type is UOW, the system may not be able to use it:

Any source tables that are not themselves identified as UOW tables are processed using Reload because the system cannot determine which subjects or subject visits may have been impacted by the changes. The end result is the same, but processing may be slower.
Similarly, if the target is defined with Subject Visit UOW and a source table is defined as Subject UOW:
- In Subject Visit UOW source tables, the transformation logic processes all records for impacted subject visits in the table and it also processes the records for subjects that are impacted in the Subject UOW source tables.
- In Subject UOW source tables, the transformation logic processes all the impacted subjects.

You can enable UOW processing by defining all tables as UOW whenever possible: when they have subject and visit columns, designate the columns with the corresponding SDTM identifier and set the processing type to UOW Subject Visit or just Subject if there is no Visit column. Do this even if you do not plan to use UOW processing to write to the table to enhance downstream processing performance.

If a source table is not a subject-oriented table, such as a lookup or reference table, and there is any change in data in it since the last execution of the program, the system uses Full Reload processing on all source tables.

Populating Surrogate Keys for Data Lineage Tracing

To enable the system to trace the lineage of every data point back to its source, all types and modes of data processing do the following the first time a transformation is run:

Create one additional column in the target table for each source table; if two source tables feed into a target table, the transformation logic adds two columns to the target table
Populate each SKEY column with the surrogate key value of the record in the source table. That value is the concatenated primary key values of the record in the source table, separated by pipes; for example:
```
SUBJID_VALUE|VISINUM_VALUE|TEST_VALUE
```

See "How the System Tracks Data Lineage" for more information.

Data Processing in Validation Check Programs

The target tables of validation checks, which are created automatically by the system, use Reload Processing. The user can choose Full or Incremental mode when scheduling or submitting the validation check batch.

Loading Data

The system supports loading data from SAS or text files or from InForm:

"Processing Data Loads from Files"
"Processing InForm Data Loads"

You set up data loading when you define input clinical data models.

Processing Data Loads from Files

You can load text or SAS files (data sets, XPORT, or CPORT files) into an input data model. You set up a File Watcher for each input data model. The File Watcher then detects when a data file appears in a specified location and proceeds to load the data automatically. See "Configuring File Watcher" for more information.

The system supports three processing modes for loading data from files:

Full (Reload)
Incremental (Reload)
UOW Load

Note:
Do not use Full mode if the file being loaded contains only new data or a subset of data because any data not reloaded will be deleted.
Do not use UOW Load mode if the file being loaded contains only new data or a subset of data for subjects or subject visits because any data not reloaded in processed units of work will be deleted. Instead, use Incremental (Reload) processing.

Note:
All deletions are "soft" deletions: records have an end timestamp equal to the load's date and time and are no longer available in the system. However, they still exist in the database and have an audit trail.

Table 9-2 Deletion Behavior in Data Loading Processing Modes

Processing Mode	Uses UOW Logic?	Deletion Behavior
UOW Load	Yes	Deletes nonreloaded records within units of work —subjects or subject visits—with new, changed, or refreshed records.
Incremental Reload	No	Does not delete any data.
Full Reload	No	Deletes all records that are not reloaded.

Processing InForm Data Loads

Each InForm study has one InForm input clinical data model for each lifecycle stage. You set up a connection and schedule and it proceeds automatically. See "Configuring the InForm Connector" for more information.

Format Checks on Loaded Files

The system uses the Oracle SQL Loader to load SAS and text files in both Reload and Unit of Work modes, and makes use of its ability to test that incoming data meets the format requirements of target columns, including data type, length, code list values (if the column is associated with a code list), and the nullable and check constraints.

Records that pass the format checks are inserted into the table one record at a time using the Oracle Insert statement. Records that are rejected are captured and validated for other errors. All the errors found in each record are reported in a file called ComprehensiveErrRpt.csv.

The error file contains one row for each record with an error that says "ORIGINAL ERROR" in the Column Name column and another row for the original error and additional rows for any other errors it finds on the same record. Even though the record is rejected after the first error, the system continues to check all column values for the record and lists all errors in the error file.

Table 9-3, "Error File Example" shows the error file entries for two records. The first record has two errors and the second has one. The string "CDR_W37_1D0156B9.TXT_TV486627301.Name" represents the full path of the erroring field, Name, for Record 1. The first part, "CDR_W37_1D0156B9," is the schema name and the second, "TXT_TV486627301," is the view based on the target table.

Records with errors are not inserted into the target tables.

When you configure File Watcher, you can define the maximum number of rejected records allowed for the load before the load fails using the Max Errors parameter.

Table 9-3 Error File Example

TABLE_NAME	FILE_NAME	REC_NUM	COLUMN_NAME	VALUE	ERROR_MESSAGE
DEMOG	DEMOG.txt	1	ORIGINAL_ERROR	Captured in the TextLoad.log	ORA-20100: ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.NAME")
DEMOG	DEMOG.txt	1	NAME		ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.NAME")
DEMOG	DEMOG.txt	1	HT		ORA-01400: cannot insert NULL into ("CDR_W37_1D0156B9.TXT_TV486627301.HT")
DEMOG	DEMOG.txt	32	ORIGINAL_ERROR	Captured in the TextLoad.log	ORA-20100: ORA-02290: check constraint (("CDR_W37_1D0156B9.TXT_TV486627301 .AGE_CK) violated
DEMOG	DEMOG.txt	32	AGE	05	ORA-02290: check constraint (("CDR_W37_1D0156B9.TXT_TV486627301 .AGE_CK) violated

Supporting Duplicate Primary Keys in a Load

In rare cases you may need to allow a single load of source data to contain records with duplicate primary key values; for example, when loading data from a small lab that may not be able to guarantee uniqueness. In those cases you can define a composite key, but it may not be sufficient to ensure uniqueness.

If you need to support duplicate primary key values within a single data load, check Supports Duplicate when you define the primary key for the target table in the input data model. Selecting this option ensures that all records are loaded and not deleted but requires careful checking of the data.

When you select Supports Duplicate, the system adds the column CDR$DUP_NUM to the target table. During each data load, the system detects whether multiple incoming records have an identical primary key value and:

The system inserts a value of 1 to the CDR$DUP_NUM column for each record with the first occurrence of a set of primary key values.
If another record with the same primary key values is loaded in the same data load, the system inserts a value of 2 into its CDR$DUP_NUM column, and 3 for the third record with the same primary key values, and so on.
During subsequent data loads, the system assumes that the first record with a particular set of primary key values is the same as the existing one with the CDR$DUP_NUM value 1, the second is the same as 2, and so on. If two records exist with values 1 and 2, and the next load contains three records with the same values, the system gives the third record a CDR$DUP_NUM value of 3.

Note:
For this system to work, the lab must always reload all records in the same order, adding new records to the end of the file.

The CDR$DUP_NUM value becomes part of the surrogate key as well as the primary key and is used for data lineage tracing; see "Populating Surrogate Keys for Data Lineage Tracing".