Skip Headers
Oracle® Health Sciences Clinical Development Analytics Administrator's Guide
Release 2.2 for Standard Configuration

E25023-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

2 Extract Transform Load Programs

This chapter contains the following topics:

To load data from the source systems to the data warehouse, OHSCDA uses Extract Transform and Load (ETL) programs that

In OHSCDA, Oracle Clinical and Oracle's Siebel Clinical are the source systems for which Oracle provides predefined ETL.

2.1 ETL Architecture

ETL is the process by which data is copied from a source database and placed in warehouse tables, available for analytic queries.

Figure 2-1 ETL Architecture

Description of Figure 2-1 follows
Description of "Figure 2-1 ETL Architecture"

OHSCDA supports the extraction of data from one or more transactional application source databases. In OHSCDA, ETL is performed by execution of Informatica Mappings. A mapping is a program that selects data from one or more tables, performs transformations and derivations on the data, and inserts or updates the results in a target table. Because data can be extracted from databases managed by different applications, the ETL mappings to load any given warehouse table is composed of two parts. The first part is a mapping that reads from a specific application's database and writes to a common warehouse Staging table. This mapping is referred to as SDE, for Source-Dependent Extract. Records in the Staging table have a common format, regardless of the source from which they were extracted. For a given warehouse table, there must be one SDE for each supported source application. The second mapping for the given warehouse table reads from its Staging table and writes to the final warehouse table. This part is called the SIL, for Source-Independent Load.

Figure 2-2 shows the ETL process for extracting data from two tables (one dimension D1 and one fact F1) in one source application database.

Figure 2-2 ETL for One Source Database

Description of Figure 2-2 follows
Description of "Figure 2-2 ETL for One Source Database"

Each numbered arrow represents the execution of a Mapping. The Mappings are executed in the order indicated by the numbers. Mapping 4, the SIL that loads fact F1 into the warehouse target table for the fact, has two components. The first part (4a) represents the copy, transformation, and derivation of the staged data for the fact. The second part (4b) represents the setting of the values of each record's foreign key to dimension D1. The result is that dimension D1 can be queried as one of the dimensions in the star-schema centered on fact F1.

OHSCDA provides SIL for each of its warehouse tables. It provides SDE to stage data for these tables from Oracle Clinical and Siebel Clinical database. The execution of the mappings is under the control of the Oracle DAC.

If data is loaded from more than one data source, it is necessary to integrate the data, by identifying and merging duplicate source records. To address this, OHSCDA provides a multi-source integration capability. Figure 2-3 illustrates how OHSCDA's ETL architecture expands to support multi-source integration. Multi-source integration is described in detail in Chapter 3, "Multi-Source Integration" .

Figure 2-3 OHSCDA ETL in Multi-source Integration

Description of Figure 2-3 follows
Description of "Figure 2-3 OHSCDA ETL in Multi-source Integration"

The OHSCDA ETL architecture is flexible. It allows you to integrate data from multiple databases instances for one application (for example, multiple Oracle Clinical databases), or databases from different source applications (for example, one Oracle Clinical and one Siebel Clinical), or any combination of these.

The OHSCDA ETL architecture is extensible. OHSCDA provides SDE mappings for Oracle Clinical and Siebel Clinical. You can add data from another application to the OHSCDA warehouse by creating the SDE to map its database tables to the OHSCDA Staging tables.

For each source application there is one SDE mapping for each warehouse table. This SDE extracts data from the source system and loads it to the staging tables. SDEs have the following features:

There is one SIL mapping for each warehouse target table. The SIL extracts the normalized data from the staging table and inserts it into the data warehouse star-schema target table. SILs have the following attributes:

Adding Data Source Information

As you read data from different database instances, you need to specify the source of the data. OHSCDA provides the W_RXI_DATASOURCE_S table (in RXI schema) that stores all information about all data sources from which data is extracted for OHSCDA. The following are some of the columns in this table:

See Also:

Handling Deletions in Siebel Clinical

OHSCDA provides an optional feature to manage hard deletion of records in Siebel Clinical. You create triggers in the source system to handle deletion of records. To do this:

  1. Navigate to the temporary staging location where the OHSCDA installer copies the installation files.

  2. Connect to the Siebel Clinical data source and run the ocda_sc_del_trigger.sql script delivered with OHSCDA. This script creates the RXI_DELETE_LOG_S table and triggers on tables provided as input. The following are the tables in Siebel Clinical for which OHSCDA supports creating triggers:

    • S_CL_PTCL_LS

    • S_PROD_INT

    • S_CL_SUBJ_LS

    • S_CONTACT

    • S_CL_PGM_LS

    • S_PTCL_SITE_LS

    • S_EVT_ACT

    • S_ORG_EXT

    Provide a list of comma separated values of table names for which the triggers need to be created as the script's input. For example, S_CL_PTCL_LS, S_PROD_INT, and S_CL_SUBJ_LS. The tables names that you provide can only be a subset of the tables listed above.

    Note that when the user deletes a record in the table, the primary key of the deleted record is inserted in the RXI_DELETE_LOG_S table on the Siebel source system.

  3. Modify the value of the DELETE_FLOW submission parameter to Y in DAC, on the Source System Parameters tab under CDA_Warehouse container within the Design view:

  4. Execute the ETLs as listed in the Executing the ETL Execution Plans section.

    The Siebel Clinical related SDE mappings read the above instance of the RXI_DELETE_LOG_S table.

Note:

Records that are deleted in the source system are soft deleted in the data warehouse.

2.1.1 Adding a New Data Source

OHSCDA provides predefined source-dependent extract (SDE) mappings for Oracle Clinical and Siebel Clinical. Enter your source system related information in W_RXI_DATASOURCE_S table in Section 2.1, "Adding Data Source Information" where structure of w_rxi_datasource_s table is described. If you want to add another database (whether for one of these applications or for another application), perform the following tasks:

  1. Create a new SDE programs to load the tables from source system to the staging area. For more information about creating a new SDE program, refer to Creating an ETL Execution Plan.

  2. Insert a record into the W_RXI_DATASOURCE_S table, assigning the source a unique DATASOURCE_NUM_ID. Set this value to a number greater than 100.

    Important:

    • When calling an SDE mapping to read from a particular database instance, ensure that you pass in the value of DATASOURCE_NUM_ID that corresponds to that database. Also, pass ENTERPRISE_ID (normally 0) to the SDE.

    • If you write new SDE, ensure that it sets the value of DATASOURCE_NUM_ID in the staging table to which it writes.

  3. In DAC, navigate to the Setup view, and then select the Physical Data Sources tab.

  4. Enter values in the following fields:

    • Name: Logical name for the database connection.

    • Type: Select Source for the database connection for a database.

    • Connection Type: Type of database.

    • Dependency Priority: Number used to generate dependencies when designing execution plans.

    • Data Source Number: Unique number assigned to the data source category so that the data can be identified in the data warehouse. Enter the same value as you have given in the W_RXI_DATASOURCE_S table.

If deduplication is required for this source system, then perform the following steps:

  1. Create an extraction program for full load that generates a flat file. Refer to the delivered code.

  2. Create an extraction program for incremental load that calls the OHMPI APIs. Refer to the delivered code.

  3. Add this new data source in OHMPI Projects file update.xml. For more information, refer to OHMPI documentation.

Note:

If you plan to add another instance of Oracle Clinical, make sure that you modify the datasource_num_id value in OCDA_W_RXI_LOV_S_seed.sql script to the value specified in W_RXI_DATASOURCE_S table for this new source system. Connect to rxi schema and execute OCDA_W_RXI_LOV_S_seed.sql script.

2.1.2 Oracle Health Sciences Clinical Development Analytics Hierarchy

This section describes the hierarchy that organizes the ETL mappings in DAC. Figure 2-4 displays the OHSCDA hierarchy:

Figure 2-4 OHSCDA Hierarchy

Description of Figure 2-4 follows
Description of "Figure 2-4 OHSCDA Hierarchy"

Following is the OHSCDA hierarchy:

  • CONTAINER (CDA_Warehouse) - A single container that holds all objects used for OHSCDA. For deduplication, however, there is a container for every deduplicated dimension that holds all the objects involved in deduplication.

  • EXECUTION PLAN - A data transformation plan defined on subject areas that need to be transformed at certain frequencies of time. An execution plan is defined based on business requirements for when the data warehouse needs to be loaded. Single Execution Plan to Load Complete Warehouse.

  • SUBJECT AREAS - A logical grouping of tables related to a particular subject or application context. It also includes the tasks that are associated with the tables, as well as the tasks required to load the tables. Subject areas are assigned to execution plans, which can be scheduled for full or incremental loads.

  • TASK GROUPS - This is a group of tasks that should be run in a given order.

  • TASKS - A unit of work for loading one or more tables. A task comprises the following: source and target tables, phase, execution type, truncate properties, and commands for full or incremental loads. Each task is a single Informatica workflow.

2.2 Executing the ETL Execution Plans

To load data from the source to their target tables in the data warehouse, run the Execution Plan packaged with OHSCDA. Perform the following tasks in DAC:

  1. Navigate to the Execute view.

  2. Select CDA – Complete Warehouse execution plan.

  3. Set the parameter values under the Parameter tab.

  4. Build the execution plan.

  5. Click Run.

    Note:

    The prune_days parameter is used to determine the extraction end date for the incremental load. By default, the value of this parameter is 1. This indicates that the source extraction end date is a day less than the ETL program run date. For example, if the ETL program run date is 28 July, 2010 the source extraction end date is 27 July, 2010.
  6. If Oracle Clinical is your only data source:

    1. Navigate to Execution Plan tab.

    2. Click CDA - Oracle Clinical Warehouse De Dup Execution Plan.

    3. Set the required parameters.

    4. Build the Execution Plan.

    5. Click Run.

    If Siebel Clinical is your only data source:

    1. Navigate to Execution Plan tab.

    2. Click CDA - Siebel Clinical Warehouse De Dup Execution Plan.

    3. Set the required parameters.

    4. Build the Execution Plan.

    5. Click Run.

Note:

Execution of the ETL (specifically the OCDA_ETL_RUN_S_POP program) populates W_ETL_RUN_S.LOAD_DT with the timestamp for the execution of the ETL. This ETL execution timestamp is used in the calculation of OHSCDA measures concerning the amount of time that currently open discrepancies have been open.

While the timestamp is captured in CURRENT_ETL_LOAD_DT, it is only available for calculation of discrepancy intervals through the OBIEE Dynamic Repository Variable CURRENT_DAY. CURRENT_DAY is refreshed from LOAD_DT at a fixed interval, by default 5 minutes, starting each time the Oracle BI Service is started. Between the time that the ETL is run, and the time that CURRENT_DAY is refreshed, calculations of intervals that currently open discrepancies have been open will be inaccurate.

There are two remedies: (i) r (ii)

Tip:

You can schedule the jobs to execute at regular intervals. For more information on scheduling jobs, refer to Scheduling an ETL Execution Plan.

If a deduplication ID is required, firstly, run the Execution Plan packaged with OHSCDA to load data from the source to their target tables in the data warehouse. Then perform the following tasks in DAC:

  1. Navigate to the Execute view and select CDA - Complete Initial De Dup Execution Plan.

  2. Set the parameter values under the Parameter tab.

  3. Build the execution plan.

  4. Click Run.

The above execution plan will generate the Flat files, which act as input to OHMPI Projects. Follow the cleanser and loader steps as documented in OHMPI documentation. Once the Bulk loading is done and data stewardship is completed follow the next set of steps:

  1. Navigate to the Execute view and select CDA - Complete Warehouse De Dup execution plan.

  2. Set the parameter values under the Parameter tab.

  3. Build the execution plan.

  4. Click Run.

  5. If Oracle Clinical is your only data source:

    1. Navigate to Execution Plan tab.

    2. Click CDA - Oracle Clinical Warehouse execution plan.

    3. Set the required parameters.

    4. Build the Execution Plan.

    5. Click Run.

  6. If Siebel Clinical is your only data source:

    1. Navigate to Execution Plan tab.

    2. Click CDA - Siebel Clinical Warehouse Execution Plan.

    3. Set the required parameters.

    4. Build the Execution Plan.

    5. Click Run.

2.3 Customizing an ETL Execution Plan

When you customize an ETL Execution Plan, it is your responsibility to maintain version control over changes to ETL mappings.

Oracle recommends that you carefully track the changes you make to Oracle-supplied ETL so that you can re-apply these changes in subsequent releases.

2.4 Creating an ETL Execution Plan

Though OHSCDA includes ETL Execution Plans for extracting data from Oracle Clinical and Siebel Clinical to OHSCDA data warehouse, you may want to create your own ETL to extract data from other data sources.

Note:

The value of DATASOURCE_NUM_ID is set to 1 for Oracle Clinical and 2 for Siebel Clinical. If you want to add your own data sources, set this value to a number greater than 100.

See Also:

If deduplication is required for a new data source then create a new extraction program for full load as well as incremental load. If a new fact ETL program is added, to the execution plan, which includes a rekeying mapping for deduplication, make sure rekeying mapping is triggered only during incremental load. To add one or more tables or columns along with the associated ETL Execution Plans to populate data into these tables, perform the following tasks:

  1. Create the new source and target table metadata inside Informatica.

  2. Work in Informatica PowerCenter and create the ETL components (transformation or workflow) for this program.

  3. Create the required workflow for this mapping.

  4. Connect to DAC and create a new task for this new mapping.

  5. Synchronize the task.

  6. Add the task to subject area.

  7. Build the Execution Plan (CDA - Complete Warehouse De Dup).

2.5 Modifying an ETL Execution Plan

You may also want to modify an existing ETL to meet your reporting requirements.

See Also:

To modify an ETL without any changes to the associated tables or columns, perform the following tasks:

  1. Identify the Execution Plan that needs to be modified in Informatica repository.

  2. Open and Modify the ETLs (transformation and/or workflow).

  3. Test and save the changes in repository.

  4. Connect to DAC and navigate to the corresponding task.

  5. Right-click the task and synchronize it.

  6. Navigate to the execution plan and execute ETL to verify the changes.

Note:

The ETL Execution Plans that extract data for the warehouse fact tables assume that the dimensions to which each fact is related are up-to-date at the time the fact ETL Execution Plans are executed. This assumption is the basis for certain fact calculations that would provide erroneous results if the assumption were not true. For example, in the received CRFs fact, the value of the pCRF entry complete measure depends on whether or not the study requires second pass entry. But that piece of information -- second pass entry required -- is obtained from an attribute of the Study dimension. So, if the second-pass requirement for a study changes, and the change is not applied to the Study dimension, the Received CRF fact attributes will contain incorrect values.

As shipped, OHSCDA ETL workflows ensure this interlock by executing the ETL for related dimensions immediately before running the ETL for a fact. This is standard warehouse management practice, but especially important given the interdependence of the dimensions and the fact. The need to execute dimension ETL immediately before corresponding fact ETL, and the danger of not doing it, is emphasized here because it is possible (though discouraged) to modify these shipped workflows.

To modify one or more tables or columns without any changes to the associated ETL programs (typically to widen a column):

  1. Change the table properties as needed.

  2. Save the mapping and refresh the workflow.

  3. Connect to DAC and navigate to corresponding task and refresh it.

Note:

If the changes to the tables or columns are not compatible with the table that is installed in the data warehouse schema, you will get a warning while making the change. For example, if you are reducing the length of a number column from 15 to 10, the change is not compatible with the existing data in the table.

If you are customizing the DAC execution plan to include additional dimensions ETL programs which are part of deduplication, make sure that all the deduplication related ETL programs are completed before you trigger the Source Independent Load ETL programs of the facts.

2.6 Scheduling an ETL Execution Plan

When you submit an Execution Plan for execution in DAC, you can schedule it execute at regular intervals. To schedule an Execution Plan, perform the following tasks:

  1. Navigate to the Scheduler tab within the Execute view.

  2. Create a new schedule plan.

  3. Enter the required details and click Save.