2 Extract Transform Load Programs

This chapter contains the following topics:

Section 2.1, "ETL Architecture"
Section 2.2, "Executing the ETL Load Plans"
Section 2.3, "Customizing an ETL Load Plan"
Section 2.4, "Creating an ETL Load Plan"
Section 2.5, "Modifying an ETL Load Plan"
Section 2.6, "Scheduling an ETL Load Plan"

To load data from the source systems to the data warehouse, OHSCDA uses Extract Transform and Load (ETL) programs that

Identify and read desired data from different data source systems,
Clean and format data uniformly, and
Write it to the target data warehouse.

In OHSCDA, Oracle Clinical, Siebel Clinical, and InForm are the source systems for which Oracle provides predefined ETL.

2.1 ETL Architecture

ETL is the process by which data is copied from a source database and placed in warehouse tables, available for analytic queries.

Figure 2-1 ETL Architecture

Description of ''Figure 2-1 ETL Architecture''

OHSCDA supports the extraction of data from one or more transactional application source databases. In OHSCDA, ETL is performed by execution of ODI Mappings. A mapping is a program that selects data from one or more tables, performs transformations and derivations on the data, and inserts or updates the results in a target table. Mappings are stored in the ODI Repository. Oracle uses ODI to schedule and supervise the execution of the entire set of ETL necessary to populate the OHSCDA warehouse.

Because data can be extracted from databases managed by different applications, the ETL mappings to load any given warehouse table are composed of two parts. The first part is a mapping that reads from a specific application's database and writes to a common warehouse staging table. This mapping is referred to as SDE, for Source-Dependent Extract. Records in the staging table have a common format, regardless of the source from which they were extracted. For a given warehouse table, there must be one SDE for each supported source application. The second mapping for the given warehouse table reads from its staging table and writes to the final warehouse table. This part is called as the SIL, for Source-Independent Load.

Figure 2-2 shows the ETL process for extracting data from two tables (one dimension D1 and one fact F1) in one source application database.

Figure 2-2 ETL for One Source Database

Description of ''Figure 2-2 ETL for One Source Database''

Each numbered arrow represents the execution of a Mapping. The Mappings are executed in the order indicated by the numbers. Mapping 4, the SIL that loads fact F1 into the warehouse target table for the fact, has two components. The first part (4a) represents the copy, transformation, and derivation of the staged data for the fact. The second part (4b) represents the setting of the values of each record's foreign key to dimension D1. The result is that dimension D1 can be queried as one of the dimensions in the star-schema centered on fact F1.

OHSCDA provides SIL for each of its warehouse tables. It provides SDE to stage data for these tables from InForm, Oracle Clinical, and Siebel Clinical database. The execution of the mappings is under the control of ODI.

The OHSCDA ETL architecture is flexible. It allows you to integrate data from multiple databases instances for one application (for example, multiple Oracle Clinical databases), or databases from different source applications (for example, one Oracle Clinical and one Siebel Clinical), or any combination of these.

The OHSCDA ETL architecture is extensible. OHSCDA provides SDE mappings for InForm, Oracle Clinical, and Siebel Clinical. You can add data from another application to the OHSCDA warehouse by creating the SDE to map its database tables to the OHSCDA staging tables.

For each source application there is one SDE mapping for each warehouse table. This SDE extracts data from the source system and loads it to the staging tables. SDEs have the following feature:

Incremental submission mode: OHSCDA-supplied ETL uses timestamps and journal tables in the source transactional system to optimize periodic loads.

There is one SIL mapping for each warehouse target table. The SIL extracts the normalized data from the staging table and inserts it into the data warehouse star-schema target table. SILs have the following attributes:

Concerning changes to dimension values over time, OHSCDA overwrites old values with new ones. This strategy is termed as Slowly Changing Dimension approach 1.
OHSCDA's data model includes aggregate tables and a number of indexes, designed to minimize query time.
The results of each ETL execution are logged by ODI. The logs hold information about errors encountered, during execution.

Every table in OHSCDA warehouse that is populated through ODI has a corresponding error log table.

The naming convention for error log table is as follows:
```
        E$_<TABLE_NAME>
```
If the length of the table name is greater than 26, a different naming convention is used for error log table. The characters until the second instance of underscore (_) in the table name are replaced with E$. The naming convention in this case will be as follows:
```
E$_<remaining characters of table name>
```
For example, if the table name is W_HS_MDM_APPLICATION_USER_DS, the characters W_HS are replaced with E$. Hence, the name of the error log table will be E$_MDM_APPLICATION_USER_DS.

During ETL execution, records which fail to be inserted in the target table (for example, some records violate a constraint) are placed in the corresponding error tables. You can review which records did not make it into the data warehouse, and decide on appropriate action with respect to them.

2.1.1 Adding Data Source Information

As you read data from different database instances, you need to specify the source of the data. OHSCDA provides the OCDA_CONTROL_S, OCDA_CONTROL_LOG_S, and W_RXI_DATASOURCE_S tables (in source-specific SRO schema) that stores information about data sources from which data is extracted for OHSCDA. These tables are populated by procedure trigger or seed data.

OCDA_CONTROL_S: This table is present in the OHSCDA SRO schema. It has only one row per Enterprise ID present in the OHSCDA application. Initial data is populated on full load and subsequent ETL Incremental Runs update this table accordingly for the specific enterprises for which it is run, thereby tracking the last extract date appropriately. The following are some of the columns in these table:
- DATASOURCE_NUM_ID: It is used to identify the data source.
- SOURCE_EXTRACTION_START_DT: It stores the ETL start date.
- SOURCE_EXTRACTION_END_DT: It stores the ETL end date.
- PROGRAM_NAME: It stores information about load plan name.
- PROGRAM_RUN_DT: It stores date on which the last ETL was successfully executed. This value is populated when ETL is executed successfully. Else, it is NULL.
- CONFIG_DAYS: It stores value for the PRUNE days. By default, its value is 1. You can modify the value of this column. End date for incremental load is derived from this column as SYSDATE - CONFIG_DAYS.
- ETL_RUN_ID: It is generated using a sequence.
OCDA_CONTROL_LOG_S: Before every ETL run the information in the table OCDA_CONTOL_S will be copied to OCDA_CONTROL_LOG_S for auditing via a trigger.
W_RXI_DATASOURCE_S: It is used for InForm source system to get DATASOURCE_NUM_ID and DATASOURCE_NAME. The following are some of the columns in these table:
- ROW_WID: A unique ID for each record in the table.
- DATASOURCE_NUM_ID: The ID for the database.
- DATASOURCE_NAME: A meaningful name of the database.
- DATASOURCE_TYPE: Application system that manages the database.
- DESC_TEXT: Optional text describing the purpose of the database.
- INTEGRATION_ID: Set this to the same values as DATASOURCE_NUM_ID
- INFORM_STUDY_GUID: Study GUID of the InForm source.

2.1.2 Handling Deletions in Siebel Clinical

OHSCDA provides an optional feature to manage hard deletion of records in Siebel Clinical. You create triggers in the source system to handle deletion of records. To do this:

Navigate to the temporary staging location where the OHSCDA installer copies the installation files.
Connect to the Siebel Clinical data source and run the ocda_sc_del_trigger.sql script delivered with OHSCDA. This script creates the triggers on tables provided as input. The following are the tables in Siebel Clinical for which OHSCDA supports creating triggers:
- S_CL_PTCL_LS
- S_PROD_INT
- S_CL_SUBJ_LS
- S_CONTACT
- S_CL_PGM_LS
- S_PTCL_SITE_LS
- S_EVT_ACT
- S_ORG_EXT
Provide a list of comma separated values of table names for which the triggers need to be created as the script's input. For example, S_CL_PTCL_LS, S_PROD_INT, and S_CL_SUBJ_LS. The tables names that you provide can only be a subset of the tables listed above.

Note:
When you delete a record in the table, the primary key of the deleted record is inserted in the RXI_DELETE_LOG_S table on the Siebel source system.
Set the values of the OHSCDA.DELETE_FLOW and DELETE_FLOW variables to Y while executing the load plan in ODI.
Execute the ETLs as listed in the Section 2.2, "Executing the ETL Load Plans".

The Siebel Clinical related SDE mappings read the above instance of the RXI_DELETE_LOG_S table.

Note:

Records that are deleted in the source system are soft deleted in the data warehouse.

2.1.3 Adding a New Data Source

OHSCDA provides predefined source-dependent extract (SDE) mappings for InForm, Oracle Clinical, and Siebel Clinical. It also provides OCDA_CONTROL_S, OCDA_CONTROL_LOG_S, and W_RXI_DATASOURCE_S tables, and procedure/trigger to populate these tables for each SRO schema.

For InForm source, enter your sources' system-related information in W_RXI_DATASOURCE_S table in Section 2.1.1, "Adding Data Source Information" where structure of w_rxi_datasource_s table is described. For other source systems, update the procedure which populates OCDA_CONTROL_S with unique DATASOURCE_NUM_ID (set this value to a number greater than 100).

Perform the following tasks to add a new data source:

Install OHSCDA in new source system (InForm, Oracle Clinical, or Siebel Clinical). For information, see Oracle Health Sciences Clinical Development Analytics Installation and Configuration Guide.

This creates SRO, work schema, and related objects for the selected source system.
In ODI Topology, create a new physical connection pointing to the new source system.
Edit the context and change Physical Schemas to point to the new physical connection.

For example, if the new source system is Oracle Clinical, edit the context CTX_OCDA_OC and change LS_OCDA_OracleClinical to point to the new physical schema.
Important:
- When calling an SDE mapping to read from a particular database instance, ensure that its corresponding SRO view has value of DATASOURCE_NUM_ID that corresponds to that database and ENTERPRISE_ID (normally 0).
- If you write new SDE, ensure that it sets the value of DATASOURCE_NUM_ID in the staging table to which it writes.
Make a copy of OCDA_W_RXI_LOV_S_seed.sql.
In the copy, modify the value of datasource_num_id to the value that you specified in the procedure that populates the OCDA_CONTROL_S table for the source database that you are adding.
Connect to the SRO account of the source.
Truncate W_RXI_LOV_S.
Run the modified copy of OCDA_W_RXI_LOV_S_seed.sql.

2.2 Executing the ETL Load Plans

OHSCDA ships the following load plans:

CDA - Complete Warehouse with Dedup: This is used to load warehouse with one or all of the three source systems (Oracle Clinical, Siebel Clinical, and InForm).
CDA- Inform Automerge: This is used to load warehouse only from Inform data source and only when automerge feature is needed. For information on how to execute the CDA- Inform Automerge load plan, see Section 2.2.2, "Executing CDA - Inform Automerge Load Plan".

Note:

Before executing the ETL load plans, you must create a validation procedure to populate the Patient Statuses table for each study using two packaged procedures provided with Oracle Clinical. For information on how to create a validation procedure, see Oracle Clinical Creating a Study.

2.2.1 Executing CDA - Complete Warehouse with Dedup Load Plan

To load data from the source to their target tables in the data warehouse, run the CDA - Complete Warehouse with Dedup load plan packaged with OHSCDA. Perform the following tasks in ODI:

Click the Operator tab.
Select Load Plans and Scenarios.
Right-click the CDA - Complete Warehouse with Dedup load plan and select Run.

The Start Load Plan dialog box appears.
Note:
- The VAR_INT_CONFIG_DAYS variable is used to determine the extraction end date for the incremental load. For full load, set the value of this variable to 1. For incremental load, set the value of this variable to 0. This indicates that the source extraction end date is a day less than the ETL program run date. For example, if the ETL program run date is 28 July, 2010 the source extraction end date is 27 July, 2010.
- The VAR_OCDA_DWH_LOAD_FLAG variable is used to determine full load or incremental load. For Incremental load, set the value of this variable to 0. For full load, set the value of this variable to 1.
- Setting VAR_OCDA_DWH_LOAD_FLAG value to 1 while executing load plan truncates the complete warehouse drop indexes on the DWH table, loads data, and creates indexes.
Select CTX_OCDA from the Context drop-down list.
Select OCDA_LA from the Logical Agent drop-down list.
Select <Use Session Task Log Level> from the Log Level drop-down list.
Assign startup values for the Load Plan Variables.
- OHSCDA.AUTO_MERGE_FLG: Set value as N.
- OHSCDA.DELETE_FLOW: Set value as Y to enable the Siebel Clinical deletion flow.
- OHSCDA_NLS_FLAG: Set value as Y to enable multi-byte character support.
- OHSCDA.VAR_IF: If InForm is your one of the or only data source(s), set value as Y to run load plan for the InForm source.
- OHSCDA.VAR_INT_CONFIG_DAYS: Set numeric value (this variable is used to determine the extraction end date for the incremental load).
- OHSCDA.VAR_OC: If Oracle Clinical is your one of the or only data source(s), set value as Y to run load plan for the Oracle Clinical source.
- OHSCDA.VAR_OCDA_DWH_LOAD_FLAG: For full load, set value as 1 and for incremental load set value as 0.
- OHSCDA.VAR_OCDA_ERROR_REJECT_LIMIT: This variable is used to set the number of rows that will be tracked in the respective error tables prior to aborting the ETL in case of errors.
  
  You must enter positive integer for this variable. For example, 0, 10, 100, 1000, and so on.
- OHSCDA.VAR_SC: If Siebel Clinical is your one of the or only data source(s), set value as Y to run load plan for the Siebel Clinical source.
- OHSCDA.VAR_De_Dup: Set value as N.
- OHSCDA.VAR_LOGFILE: Currently, not in use.
- OHSCDA.VAR_AUTHFILE: Currently, not in use.
- OHSCDA.NLS_FLAG: Set this flag as Y if you want to enable the multi-byte support.
Click OK.

Note:

Execution of the ETL (specifically the OCDA_ETL_RUN_S_POP program) populates W_ETL_RUN_S.LOAD_DT with the timestamp for the execution of the ETL. This ETL execution timestamp is used in the calculation of OHSCDA measures concerning the amount of time that currently open discrepancies have been open.

While the timestamp is captured in CURRENT_ETL_LOAD_DT, it is only available for calculation of discrepancy intervals through the OBIEE Dynamic Repository Variable CURRENT_DAY. CURRENT_DAY is refreshed from LOAD_DT at a fixed interval, by default 5 minutes, starting each time the Oracle BI Service is started. Between the time that the ETL is run, and the time that CURRENT_DAY is refreshed, calculations of intervals that currently open discrepancies have been open will be inaccurate.

The following are the two remedies:

Restart the Oracle BI Server after every execution of the ETL. This will cause CURRENT_DAY to be refreshed to the correct value.
If this is inconvenient, you can modify the intervals between refreshes of the value of CURRENT_DAY. For more information on how to modify the refresh interval for CURRENT_DAY, see Section 1.1, "Maintaining the Oracle Health Sciences Clinical Development Analytics OBIEE Repository".

Tip:

You can schedule the jobs to execute at regular intervals. For more information on scheduling jobs, see Section 2.6, "Scheduling an ETL Load Plan".

2.2.2 Executing CDA - Inform Automerge Load Plan

To load data from the source to their target tables in the data warehouse, run the CDA - Inform Automerge load plan packaged with OHSCDA. Perform the following tasks in ODI:

Click the Operator tab.
Select Load Plans and Scenarios.
Right-click the CDA - Inform Automerge load plan and select Run.

The Start Load Plan dialog box appears.
Select CTX_OCDA from the Context drop-down list.
Select OCDA_LA from the Logical Agent drop-down list.
Select <Use Session Task Log Level> from the Log Level drop-down list.
Assign startup values for the Load Plan Variables.
- OHSCDA.AUTO_MERGE_FLG: Set value as Y.
- OHSCDA.VAR_INT_CONFIG_DAYS: Set numeric value (this variable is used to determine the extraction end date for the incremental load).
- OHSCDA.VAR_OCDA_DWH_LOAD_FLAG: For full load, set value as 1 and for incremental load set value as 0.
- OHSCDA.VAR_OCDA_ERROR_REJECT_LIMIT: This variable is used to set the number of rows that will be tracked in the respective error tables prior to aborting the ETL in case of errors.
  
  You must enter positive integer for this variable. For example, 0, 10, 100, 1000, and so on.
Click OK.

2.3 Customizing an ETL Load Plan

When you customize an ETL load plan, it is your responsibility to maintain version control over changes to ETL mappings.

Oracle recommends that you carefully track the changes you make to Oracle-supplied ETL so that you can re-apply these changes in subsequent releases.

2.4 Creating an ETL Load Plan

Though OHSCDA includes ETL load plans for extracting data from InForm, Oracle Clinical, and Siebel Clinical to OHSCDA data warehouse, you may want to create your own ETL to extract data from other data sources.

Note:

The value of DATASOURCE_NUM_ID is set to 1 for Oracle Clinical, 2 for Siebel Clinical, and 3 for InForm. If you want to add your own data sources, set this value to a number greater than 100.

2.5 Modifying an ETL Load Plan

You may also want to modify an existing ETL load plan to meet your reporting requirements.

2.6 Scheduling an ETL Load Plan

When you submit a load plan for execution in ODI, you can schedule it to execute at regular intervals. To schedule a load plan, perform the following steps:

Select Load Plans and Scenarios.
Expand the CDA - Complete Warehouse with Dedup load plan.
Right-click Scheduling and select New Scheduling.
Select Definition.
Select the required values for the following:
- Context
- Logical Agent
- Log Level
- Status
- Execution
Click Save.