5 Extract Transform Load Programs

This chapter contains the following topics:

ETL Architecture
Executing the ETL Programs
Customizing an ETL Program
Creating an ETL Program
Modifying an ETL Program
Customizing the Pooling Program
Scheduling an ETL Program
Setting Up the Target Load Type

To load data from the source systems to the data warehouse, OCDA uses Extract Transform and Load (ETL) programs that

Identify and read desired data from different data source systems,
Clean and format data uniformly, and
Write it to the target data warehouse.

In OCDA, Oracle Clinical and Oracle's Siebel Clinical are the source systems for which Oracle provides predefined ETL.

ETL Architecture

Figure 5-1 displays the ETL process delivered with OCDA.

Figure 5-1 The OCDA Architecture

Description of "Figure 5-1 The OCDA Architecture"

OCDA uses Oracle Life Sciences Data Hub (Oracle LSH) to maintain star-schema tables that enable user reporting. Set up as a recurring job, the Oracle LSH Extraction, Transformation, and Load process (ETL) is designed to periodically capture targeted metrics (dimension and fact data) from multiple clinical trial databases, transform and organize them for efficient query, and populate the Oracle LSH star-schema tables.

While the OCDA data model supports data extraction from multiple sources, OCDA only includes source-dependent extract (SDE) mappings for the Oracle Clinical and Siebel Clinical databases. However, you can also define SDE mappings from additional external sources that write to the appropriate staging tables. Note that you are responsible for resolving any duplicate records that may be created as a consequence. For more information about how to add a new data source to OCDA, refer to Adding a New Data Source.

Oracle LSH uses pass-through views to access transactional data from source databases. The SDE programs map the transactional data to source specific staging tables, in which the data must conform to a standardized format, effectively merging the data from multiple, disparate database sources. This is the architectural feature that accommodates external database sourcing.

A pooling program reads the data from all the source specific staging tables, and loads it into the common staging table.

Oracle LSH thence transforms the staged data (in the common staging table) using source-independent loads (SILs) to internal Star-schema tables, where such data are organized for efficient query by the Oracle BI Server.

There is one SDE mapping for each target table, which extracts data from the source system and loads it to the respective source specific staging tables. SDEs have the following features:

Incremental submission mode: OCDA supplied ETL uses timestamps and journal tables in the source transactional system to optimize periodic loads.
Bulk and normal load: Bulk load uses block transfers to expedite loading of large data volume. It is intended for use during initial data warehouse population. Bulk load is faster, if data volume is sufficiently large. However, if load is interrupted (for example, disk space is exhausted, power failure), load cannot be restarted in the middle; you must restart the load.

Normal load writes one record at a time. It is intended to be used for updates to the data warehouse, once population has been completed. Normal load is faster, if data volume is sufficiently small. You can also restart load if the load is interrupted.

You must set the appropriate target load type for an ETL program in Oracle LSH to indicate bulk and normal load. In OCDA, by default, bulk load is enabled for all SDEs.

See Also:

Setting Up the Target Load Type for information about setting the appropriate table processing type.

There is one SIL mapping for each target table. The SIL extracts the normalized data from the common staging table and inserts it into the data warehouse star-schema target table. SILs have the following attributes:

Concerning changes to dimension values over time, OCDA overwrites old values with new ones. This strategy is termed as Slowly Changing Dimension approach 1.
OCDA's data model includes aggregate tables and a number of indexes, designed to minimize query time.
By default, bulk load is disabled for all SILs.
The results of each ETL execution is logged. The logs hold information about errors encountered, during execution.

Informatica provides the following four error tables:
- PMERR_DATA
- PMERR_MSG
- PMERR_SESS
- PMERR_TRANS
During ETL execution, records which fail to be inserted in the target table (for example, some records violate a constraint) are placed in the Informatica PowerCenter error tables. You can review which records did not make it into the data warehouse, and decide on appropriate action with respect to them.

Adding Data Source Information

As you read data from different database instances, you need to specify the source of the data. OCDA provides the W_RXI_DATASOURCE_S table (in RXI schema) that stores all information about all data sources from which data is extracted for OCDA. The following are some of the columns in this table:

ROW_WID - A unique ID for each record in the table.
DATASOURCE_NUM_ID - The ID for the database. Must be coordinated with the value given to the database when ETL is run.
DATASOURCE_NAME - A meaningful name of the database.
DATASOURCE_TYPE - Application system that manages the database.
DESC_TEXT - Optional text describing the purpose of the database.
INTEGRATION_ID - Set this to the same values as DATASOURCE_NUM_ID

See Also:

Oracle Clinical Development Analytics Electronic Technical Reference Manual, for more information about the W_RXI_DATASOURCE_S table.
Adding a New Data Source, for more information about how to add a new data source to OCDA.

Handling Deletions in Siebel Clinical

OCDA provides an optional feature to manage hard deletion of records in Siebel Clinical. You create triggers in the source system to handle deletion of records. To do this:

Navigate to the temporary staging location where the OCDA installer copies the installation files.
Connect to the Siebel Clinical data source and run the ocda_sc_del_trigger.sql script delivered with OCDA. This script creates the RXI_DELETE_LOG_S table and triggers on tables provided as input. The following are the tables in Siebel Clinical for which OCDA supports creating triggers:
- S_CL_PTCL_LS
- S_PROD_INT
- S_CL_SUBJ_LS
- S_CONTACT
- S_CL_PGM_LS
- S_PTCL_SITE_LS
- S_EVT_ACT
Provide a list of comma separated values of table names for which the triggers needs to be created as the script's input. For example, S_CL_PTCL_LS,S_PROD_INT,S_CL_SUBJ_LS. The tables names that you provide can only be a subset of the tables listed above.

Note that when the user deletes a record in the table, the primary key of the deleted record is inserted in the RXI_DELETE_LOG_S table on the Siebel source system.
Update the remote location of the OCDA_RXI_DELETE_LS load set in OCDA_DELETE_LOG_WA to Siebel Clinical source database connection and install this work area.
1. Navigate to the OCDA_SOURCES_APP_AREA.
2. Click OCDA_DELETE_LOG_WA work area.
3. Click OCDA_RXI_DELETE_LS loadset.
4. Click Check Out.
5. Click Apply.
6. In the Load Set Attributes section, click Update.
7. Click the Search icon.
8. Select OCDA_SC_OLTP_RL/<Connection_Name>.
9. Click Apply.
10. Reinstall the work area containing the load set and passthrough views.
For more information, refer to Oracle Clinical Development Analytics Installation Guide (Post Installation Tasks)
Modify the value of the DELETE_FLOW submission parameter for the following dimension programs based on the triggers created in step 2:
- OCDA_INFA_Study_Dim_SDE_SC_PRG
- OCDA_INFA_Study_Site_Dim_SDE_SC_PRG
- OCDA_INFA_Study_Region_Dim_SDE_SC_PRG
- OCDA_INFA_Program_Dim_SDE_SC_PRG
- OCDA_INFA_Product_Dim_SDE_SC_PRG
- OCDA_INFA_Study_Subject_Dim_SDE_SC_PRG
- OCDA_INFA_Party_Per_Dim_SDE_SC_PRG
- OCDA_INFA_SS_Con_Dim_SDE_SC_PRG
Perform the following steps:
1. Navigate OCDA_domain > OCDA_CODE_APP_AREA > OCDA_SDE_SC_WORK_AREA.
2. Click the Name hyperlink of the program.
3. Click Submit.
4. Enter the following information in Submission Details:
  - Submission Type: Backchain
  - Force Execution: Yes
5. In Submission Parameters, enter the value of DELETE_FLOW as Y. The default value is N, which indicates that OCDA does not handle deletion in Siebel Clinical.
6. Click Submit.
Execute the ETLs as listed in the Executing the ETL Programs section.

The Siebel Clinical related SDE mappings reads the above instance of the RXI_DELETE_LOG_S table.

Note:

Records that are deleted in the source system are soft deleted in the data warehouse.

See Also:

Oracle Life Sciences Data Hub User's Guide, (Tracking Job Execution), for more information about viewing job execution logs.

Adding a New Data Source

OCDA provides predefined ETL for Oracle Clinical and Siebel Clinical. To add a new data source, perform the following tasks:

See Also:

Oracle Life Sciences Data Hub Application Developer's Guide

Create a new Work Area to add the new load set for the source tables.
Create a new remote location.
Create a new load set in the new Work Area you created in step 1.
Assign the remote location to the load set created in step 3.
Import the metadata (source).
Create a new staging area.

Tip:
You can copy and rename the existing stage table definitions (W_RXI_OC_STUDY_DS or W_RXI_SC_STUDY_DS).
Create a new Work Area to add the new ETL program.
Create a new SDE programs to load the tables from source system to the staging area. For more information about creating a new SDE program, refer to Creating an ETL Program.
Customize the OCDA_POOL_WORK_AREA pooling program to include data from the new staging area. For more information about customizing the pooling program, refer to Customizing the Pooling Program.
Insert data into the W_RXI_DATASOURCE_S table and assign the source a DATASOURCE_NUM_ID. Set this value to a number greater than 100.
Important:
Ensure the following:
- You pass DATASOURCE_NUM_ID and TENANT_ID (default value is 0) as parameters to the SDE.
- The DATASOURCE_NUM_ID column is populated in all your tables using the SDE.

Oracle Clinical Development Analytics Domain Structure in Oracle Life Sciences Data Hub

Figure 5-2 displays the OCDA domain hierarchy in Oracle LSH:

Figure 5-2 OCDA Domain Hierarchy in Oracle LSH

Description of "Figure 5-2 OCDA Domain Hierarchy in Oracle LSH"

Executing the ETL Programs

To load data from the staging tables to their respective target tables in the data warehouse, execute the SIL programs packaged with OCDA. Perform the following tasks in Oracle LSH:

Important:

Ensure the following:

The Informatica Distributed Processing (DP) Server is up and running.
You execute the base fact ETL programs first before you execute the aggregate fact ETL programs

Navigate to OCDA_domain > OCDA_SOURCES_APP_AREA > OCDA_DWH_PASS_THROUGH_WA

Note:
Ensure that you have installed the domain, Application Area, and Work Area in Oracle LSH before you can perform the subsequent steps.
For more information, refer to Oracle Clinical Development Analytics Installation Guide (Post Installation Tasks).
If you are executing the ETL programs for the first time after installing OCDA, submit the OCDA_PLS_S_DUAL_D_PRG program before submitting any fact in backchain.
Navigate to OCDA_domain > OCDA_CODE_APP_AREA > OCDA_WORK_AREA.

Note:
Ensure that you have installed the domain, Application Area, and Work Area in Oracle LSH before you can perform the subsequent steps.
For more information, refer to Oracle Clinical Development Analytics Installation Guide (Post Installation Tasks).
If you are executing the ETL programs for the first time after installing OCDA, submit the following programs in the given order before submitting any fact in backchain:
1. OCDA_PLS_DUAL_PRG
2. OCDA_INFA_DayDimension_SIL_PRG
3. OCDA_INFA_MonthDimension_SIL_PRG
Important:
Ensure that you submit the above programs every time you modify them.
Set the config_days submission parameter in the OCDA_CONTROL_TABLE_POPULATE_PRG program. To do this:
1. Navigate to OCDA_domain > OCDA_SOURCES_APP_AREA > OCDA_CONTROL_TABLE_WA.
2. Click the OCDA_CONTROL_TABLE_POPULATE_PRG hyperlink.
3. Click Submit.
4. Enter the value of the config_days submission parameter.
  Note:
  Control table stores the time range for which the records are extracted from the database. When a base fact SIL program is submitted or triggered in backchain, the data in the control table is populated first. This table stores the following information:
  - DATASOURCE_NUM_ID
  - ETL_RUN_ID
  - MASTER_JOB_ID
  - SOURCE_EXTRACTION_START_DT
  - SOURCE_EXTRACTION_END_DT
  - PROGRAM_NAME
  - PROGRAM_RUN_DT
  The config_days parameter is used to determine the extraction end date for the incremental load. By default, the value of this parameter is 1. This indicates that the source extraction end date is a day less than the ETL program run date. For example, if the ETL program run date is 28 July, 2010 the source extraction end date is 27 July, 2010.
  
  The value of the config_days parameter can also be a fraction or a negative value. For example, if the value of this parameter is .5, the source extraction end date is 12 hours less than the system date in the target database.
5. In Submission Type, select Backchain.
6. In Force Execution, select Yes.
7. Click Submit.
Navigate to OCDA_domain > OCDA_UTIL_APP_AREA > OCDA_ETL_WORKFLOW_WA.

If Oracle Clinical is your only data source, remove the following Table Descriptors from OCDA_PLS_LEVEL1_FACT_PRG:
- W_ACTIVITY_F
- W_RXI_RGN_ENRLMNT_PLN_F
If Siebel Clinical is your only data source, remove the following Table Descriptors from OCDA_PLS_LEVEL1_FACT_PRG:
- W_RXI_DISCREPANCY_F
- W_RXI_DISCRPNCY_STATUS_F
- W_RXI_RECEIVED_CRF_F
- Install the program OCDA_PLS_LEVEL1_FACT_PRG.
If Siebel Clinical is your only data source, remove the following Table Descriptors from OCDA_PLS_LEVEL1_AGG_FACT_PRG:
- W_RXI_DISCREPANCY_A
- W_RXI_DISCRPNCY_STATUS_A
- W_RXI_RECEIVED_CRF_A
- Install the program OCDA_PLS_LEVEL1_AGG_FACT_PRG.
Submit the OCDA_PLS_ETL_WORKFLOW_PRG program.

Note:
Enure the that the value of the config_days submission parameter is same as in Step 5 of this section.

Note:

Execution of the ETL (specifically the OCDA_ETL_RUN_S_POP_PRG program) populates W_ETL_RUN_S.LOAD_DT with the timestamp for the execution of the ETL. This ETL execution timestamp is used in the calculation of OCDA measures concerning the amount of time that currently open discrepancies have been open.

While the timestamp is captured in CURRENT_ETL_LOAD_DT, it is only available for calculation of discrepancy intervals through the OBIEE Dynamic Repository Variable CURRENT_DAY. CURRENT_DAY is refreshed from LOAD_DT at a fixed interval, by default 5 minutes, starting each time the Oracle BI Service is started. Between the time that the ETL is run, and the time that CURRENT_DAY is refreshed, calculations of intervals that currently open discrepancies have been open will be inaccurate.

There are two remedies: (i) restart the Oracle BI Server after every execution of the ETL. This will cause CURRENT_DAY to be refreshed to the correct value. (ii) If this is inconvenient, you can modify the intervals between refreshes of the value of CURRENT_DAY. For more information on how to modify the refresh interval for CURRENT_DAY, refer to Maintaining the Oracle Clinical Development Analytics Repository.

Tip:

You can schedule the jobs to execute at regular intervals. For more information on scheduling jobs, refer to Scheduling an ETL Program.

Customizing an ETL Program

The following rules apply when you customize an ETL program:

You can customize ETL programs in different ways. You can create a new domain, new Application Area within same domain, or a new Work Area within the same Application Area. Creating a new domain, and storing customized definitions in the new domain will ensure that the definitions are not overwritten on the next OCDA upgrade.

Oracle recommends that you set up a single domain for all customization to OCDA. This will ensure that all the customized object definitions are available in the same top-level container.
After you create your own Work Area, clone the Oracle-supplied Work Area on to your own Work Area. This creates copies of object instances inside your Work Area, but they point to the object definitions inside the Oracle-supplied Application Area.

Caution:

To correctly track the timing of OCDA ETL execution, Oracle recommends that no Program in the OCDA Domain, other than the SIL provided with OCDA, read from any OCDA staging table. The staging tables, at any given time, hold only transient records that were loaded during the most recent ETL execution. Ideally, you may not read from them. If it is necessary to read from an OCDA staging tables, Oracle recommends that you define the Program in a Domain other than the OCDA Domain containing the staging table, and execute it in that separate Domain.

Creating an ETL Program

Though OCDA includes ETL programs for extracting data from Oracle Clinical and Siebel Clinical to OCDA data warehouse, you may want to create your own ETL to extract data from other data sources.

Note:

The value of DATASOURCE_NUM_ID is set to 1 for Oracle Clinical and 2 for Siebel Clinical. If you want to add your own data sources, set this value to a number greater than 100.

See Also:

Oracle Life Sciences Data Hub Application Developer's Guide (Defining Programs)
Informatica PowerCenter Online Help

To add one or more tables or columns along with the associated ETL programs to populate data into these table, perform the following tasks:

Create the new source and target table metadata inside your Work Area.

If the tables exist outside Oracle LSH in some remote schema, flat file, or SAS file, you can upload the table structure into Oracle LSH using an Oracle LSH Load Set.
Create a Program in Oracle LSH and specify that the Program is an Informatica-type Program.
Add these tables as sources or targets.
Install the new Oracle LSH Program to ensure that the new tables are created in the Work Area schema.
Check out your new Oracle LSH program.

Important:
Do not use the Copy definition to the local Application Area and check out option when checking out the program.
In the Program's screen, click Launch IDE.

This launches Informatica PowerCenter client installed on your machine.
Work in Informatica PowerCentre and create the ETL components (transformation or workflow) used by this Oracle LSH Program.
Go back to Oracle LSH and upload the ETL file from Informatica PowerCenter to Oracle LSH.
Install and run the Program in Oracle LSH.

Important:
Before you reinstall, ensure that the Informatica DP Server is up and running.

Tip:
If the target table you added relies on new source tables, and if you want the source tables to be automatically populated when you trigger the ETL program for the final target table, enable backchaining for the Oracle LSH Program that populates the source tables.
If there are no new source tables in your customization, Oracle LSH will automatically trigger the population of the source tables when the ETL program to populate the final target table is executed. Note that the backchain submissions are not cloned to the Work Area, and you have to manually create them.

For more information on backchaining, refer to Oracle Life Sciences Data Hub Application Developer's Guide (Execution and Data Handling).

Modifying an ETL Program

You may also want to modify an existing ETL to meet your reporting requirements.

See Also:

Oracle Life Sciences Data Hub Application Developer's Guide (Defining Programs)
Informatica PowerCenter Online Help

To modify an ETL without any changes to the associated tables or columns, perform the following tasks:

Install your Work Area and run the ETL to ensure that you can see data populated in the target data warehouse tables.
Identify the Oracle LSH program that contains the metadata for the ETL that needs to be modified.
Check out the Oracle LSH program that contains the metadata for that ETL.

Important:
Use the Copy definition to the local Application Area and check out option to check out the program. This ensures that you do not modify the definitions inside the domain shipped with OCDA.
In the Program's screen, click Launch IDE.

This launches Informatica PowerCenter client installed on your machine.
Modify the ETLs (transformation and/or workflow) used by the Oracle LSH Program.
Test and upload the ETL from Informatica PowerCenter to Oracle LSH.
Install the program in Oracle LSH, and run it to verify the changes.

Note:

The ETL programs that extract data for the warehouse fact tables assume that the dimensions to which each fact is related are up-to-date at the time the fact ETL programs are executed. This assumption is the basis for certain fact calculations that would provide erroneous results if the assumption were not true. For example, in the received CRFs fact, the value of the pCRF entry complete measure depends on whether or not the study requires second pass entry. But that piece of information -- second pass entry required -- is obtained from an attribute of the Study dimension. So, if the second-pass requirement for a study changes, and the change is not applied to the Study dimension, the Received CRF fact attributes will contain incorrect values.

As shipped, OCDA ETL workflows ensure this interlock by executing the ETL for related dimensions immediately before running the ETL for a fact. This is standard warehouse management practice, but especially important given the interdependence of the dimensions and the fact. The need to execute dimension ETL immediately before corresponding fact ETL, and the danger of not doing it, is emphasized here because it is possible (though discouraged) to modify these shipped workflows.

To modify one or more tables or columns without any changes to the associated ETL programs:

Install your Work Area and run the ETL to ensure that you can see data populated in the target data warehouse tables.
Check out the Oracle LSH program that contains the metadata for that ETL.

IMPORTANT:
Use the Copy definition to the local Application Area and check out option when checking out the program. This ensures that you do not modify the definitions inside the domain shipped with OCDA.
Change the table properties.

To change the underlying columns and variables, check out the variables that the columns are pointing to. Ensure that you use the Copy definition to the local Application Area and check out option when checking out the variable.

Note:
If the changes to the tables or columns are not compatible with the table that is installed in the data warehouse schema, you will get a warning while making the change. For example, if you are reducing the length of a number column from 15 to 10, the change is not compatible with the data existing in the table. Such changes will not let you perform an Upgrade install on the table. You will have to drop and create the table using Partial or Full install.
Install the changed table or column, and run the ETL program that populates it.

Customizing the Pooling Program

To modify the pooling program, perform the following tasks:

Identify the Oracle LSH program that contains the metadata for the ETL that needs to be modified.
Check out the Oracle LSH program that contains the metadata for that ETL.

Important:
Use the Copy definition to the local Application Area and check out option to check out the program. This ensures that you do not modify the definitions inside the domain shipped with OCDA.
In the Program's screen, click Launch IDE.

This launches Informatica PowerCenter client installed on your machine.
Modify the pooling program. Add the newly created stage table (in step 6) as the source.
Test and upload the ETL from Informatica PowerCenter to Oracle LSH.
Install the program in Oracle LSH, and run it to verify the changes.

Scheduling an ETL Program

When you submit a Program for execution in Oracle LSH, you can schedule it execute at regular intervals. To schedule a Program, perform the following tasks:

In the appropriate Work Area, navigate to the installed executable instance you want to submit and click Submit.

The Submit Execution Setup screen is displayed.

For more information on how to submit an Execution Setup, refer to Oracle Life Sciences Data Hub Application Developer's Guide (Submitting Jobs for Execution).
In the Submission Details section, select Submission Type as Scheduled.

The Schedule Submission section is displayed.
Enter the required details and click Submit.

Setting Up the Target Load Type

When you submit a Program for execution, perform the following tasks to specify the table processing type:

In the appropriate Work Area, navigate to the installed executable instance you want to submit.
In the Program's screen, click Launch IDE.

This launches Informatica PowerCenter client installed on your machine.
In the Workflow Manager, modify the Target load type setting to Bulk or Normal.
Reinstall the program in Oracle LSH.
Navigate to the installed executable instance you want to submit, and click Submit.

The Submit Execution Setup screen is displayed.

For more information on how to submit an Execution Setup, refer to Oracle Life Sciences Data Hub Application Developer's Guide (Submitting Jobs for Execution).
In the Submission Parameters section, select the Parameter Value for Bulk Load based on what you have set up in Step 3. Select Yes if the table processing type is bulk, and No if it is normal.
Enter the required details and click Submit.