ETL Overview

About ETL Architecture

Typically, the extract-load-transform process has two main steps: The first step is the extract and stage load step and the second step is the load transform step.

The extract and stage load step is generated from a combination of the main interface and the temporary interface. The load transform step is generated as a result of the integration knowledge module (IKM). In this on-premises example, step 1 issues a SQL statement on the source that joins the GL_SET_OF_BOOKS table with the GL_PERIOD_TYPES table. The join is executed on the source database and the resulting data is staged. Then, a second join occurs at the load transform stage between the W_DOMAIN_G table and the temporary stage table, which results in the loading of the stage table W_LEDGER_DS.

Note:

Oracle Database is the only database type supported for the Oracle BI Applications repository schemas and the Business Analytics Warehouse.

Description of the illustration etl_arch.png

There are four main stages: The first, unique to cloud sources, is the Source Dependent Data Store (SDS) stage, which loads and incrementally maintains replicated data into a Source Dependent Data Store schema from cloud sources, for example Fusion Cloud, Taleo Cloud, and so on. A Health Check stage generates a diagnostic report identifying problematic source data that might cause ETL failure, or data loss, or corruption in the data warehouse. The Source Dependent Extract (SDE) tasks then extract data from either SDS schema tables in the case of cloud sources, or source dimension and fact tables in the case of on-premises sources, and load the data into universal dimension and fact staging tables. The Source Independent Load (SIL) tasks are common and load data from the universal staging tables into the warehouse staging tables. This figure depicts a dependency between the dimension table and the fact table. Therefore, the SIL DIM must be executed before the SIL FACT to resolve the dimension key. The SIL DIM has a database sequence that generates the key, and then the SIL FACT looks up that key when loading the fact staging table.

Description of the illustration etl_flow.png

About ETL Phases

Oracle BI Applications ETL processes include these phases: SDS, Health Check, SDE, SIL, and PLP.

SDS stands for Source Dependent Data Store. In this phase, a separate schema on the data warehouse database is maintained as a replication of the source transactional systems’ tables, deletions, as well as additional optimizations for incremental ETL. Each SDS requires its own separate schema because there can be multiple SDS each having the same object names. Typically, you see a corresponding load plan step, SDS Load Phase, in your generated load plan when you enable extraction from cloud sources, for example when extracting data from Fusion Cloud, Taleo Cloud, and so on. SDS Load Phase tasks extract data from cloud sources and stage it in SDS tables.
Health Check. Health Check is a preliminary ETL phase in which a diagnostic report is generated to identify problematic source data that might cause ETL failure or data loss or corruption in the data warehouse. The report is downloaded from Configuration Manager and includes any problematic data, a description of detected issues, and actions to resolve health check failures.
SDE stands for Source Dependent Extract. In this phase, SDE tasks extract data from the source system and SDS and stage it in staging tables. The SDE tasks are source-specific.
SIL stands for Source Independent Load. Load tasks transform and port the data from staging tables to base fact or dimension tables. The SIL tasks are source-independent.
PLP stands for Post Load Process. PLP tasks are only executed after the dimension and fact tables are populated. A typical usage of a PLP task is to transform data from a base fact table and load it into an aggregate table. The PLP tasks are source-independent.

About the Oracle Data Integrator Repository

The Oracle Data Integrator (ODI) Repository for Oracle BI Applications comprises two repositories: master and work.

Master Repository. Contains topology of resources, security, and version management. A master repository is usually associated with multiple work repositories.
Work Repository. Contains data models and projects. A work repository is the development and execution repository and is attached to a single master repository.

The master and work repositories must reside in the same database schema and the database type must be Oracle. Both the master and work repositories are set up during the Oracle BI Applications installation process.

The default ODI repository ID is 500. This ID is part of the internal ID for every object that is created within the ODI repository. Having a repository ID greater than or equal to 500 is critical to ensure that the objects you create don't overlap with any current or future Oracle BI Applications objects. If you change the default ODI repository ID, make sure the new value is greater than 500.

About Load Plans

A load plan is an executable object that comprises and organizes the child objects (referred to as steps) that carry out the ETL process. A load plan is made up of a sequence of several types of steps. Each step can contain several child steps. Depending on the step type, the steps can be executed conditionally, in parallel, or sequentially.

You define a load plan in Configuration Manager by selecting a data source and one or more fact groups. This selection determines which steps you should perform during the ETL process. Each fact group belongs to a specific functional area or areas that are associated with one or more offerings, which, in turn, are related to a data server. A transactional data source is associated with one or more data servers.

After you define the load plan, you then generate it to build it in the ODI repository. You then execute the load plan to perform the ETL process.

To work with Oracle BI Applications load plans, see Managing Load Plans. To work with load plans in the context of Oracle Data Integrator, see Introduction to Load Plans in Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

About Changed Data Capture

Oracle BI Applications has two ETL modes for loading data into the Oracle Business Analytics Warehouse: full and incremental.

During a full load, Oracle BI Applications extracts:

All records from tables that are sources for dimension tables.
Records created after an Initial Extract Date from tables that are sources for fact tables. The Initial Extract Date defines a cut-off so that not all records are loaded into the data warehouse. You set the Initial Extract Date value for each data source in Configuration Manager.

An ETL process can extract a record from a single table or from multiple tables. When a record is the result of joining multiple tables, one of these tables is identified as the base table, which defines the granularity of the record. When extracting fact records, Oracle BI Applications only compares the Created Date of the base table to the Initial Extract Date.

During an incremental load, Oracle BI Applications extracts records that have changed or were created after a Last Extract Date. This is done by comparing the Last Extract Date value to a Last Updated Date (LUD) type column in the source table. If the source table doesn't have such a column, Oracle BI Applications extracts all records from that table. The Last Extract Date is a value that is calculated based on the last time data was extracted from that table less a Prune Days value. The Prune Days parameter is used to extend the window of the ETL extract beyond the last time the ETL actually ran. This is to ensure records that might have somehow been missed in a prior ETL are picked up in the next ETL. Records can be missed in an ETL process when a record is being updated while the ETL process is running and was not committed until after the ETL completed.

You set the Prune Days parameter value in Configuration Manager before the first full load ETL and the value automatically takes effect in any subsequent incremental ETL. Bear the following in mind:

Setting a small value means the ETL extracts fewer records, thus improving performance. However, this setting increases the chances that records are not detected.
Setting a large number is useful if ETL runs are infrequent, but this increases the number of records that are extracted and updated in the data warehouse. Therefore, you shouldn't set the Prune Days value to a very large number. You can also use large Prune Days number to trigger re-extracting records that are processed, but haven't changed.
Setting a value to 0 means the ETL shouldn't extract any record. Hence, you should never set the Prune Days to 0.

If you haven't set the Prune Days parameter prior to running a full ETL, but want to use one for a current incremental load, you can do so using the RESET_LAST_EXTRACT_DATE scenario in the RESET_LAST_EXTRACT_DATE procedure. For the scenario, in ODI Designer, set the BIAPPS.DATASOURCE_NUM_ID parameter to the source system’s ID and set the BIAPPS.PRUNE_DAYS parameter to the desired prune days. Execute the RESET_LAST_EXTRACT_DATE procedure from the ODI Console to set the Prune Days.

An ETL process can extract a record from a single table but more commonly extracts records that are the result of joining multiple tables. When extracting from multiple tables, one table is identified as the base table, which defines the granularity of the record. When there's a change in the base table, an extract is triggered for the record. However, there can be cases where a change occurs in a non-base table but not in the base table itself. If a change occurs in a non-base table and this should trigger an extract of the record, these tables are referred to as auxiliary tables. Thus, when determining if a record should be extracted, Oracle BI Applications compares not only the LUD column in the base table, but the LUD columns in all auxiliary tables. If the LUD column changed in any of these tables, the record is extracted. If changes can occur in a table that isn't relevant to triggering an extract, this table's LUD column isn't compared in the incremental filtering logic.

About Knowledge Modules

Knowledge Modules (KMs) implement different tasks within the Oracle Business Analytics Warehouse system.

The following are the different types of KMs available:

Reverse-engineering (RKM). Used for reading the table and other object metadata from source databases and to import tables, columns, and indexes into a model. See About Reverse Knowledge Modules.
Loading (LKM). Used for efficient extraction of data from source databases for loading into a staging area (database-specific bulk unload utilities can be used where available).
Integration (IKM). Used to load data into a target with different strategies, for example, slowly changing dimensions and insert/update strategies.
Check (CKM). Used to validate and cleanse data.
Journalizing (JKM). Used to record the new and changed data within either a single table or view or a consistent set of tables or views.
Service (SKM). Used to expose data in the form of web services.

About Reverse Knowledge Modules

Oracle BI Applications uses the ODI reverse engineering process to populate the repository with metadata from the source system. Reverse Knowledge Modules (RKMs) retrieve metadata from data storage and load it into the repository.

For example, RKMs detect the description of tables, columns, data types, constraints, and comments from a database to load the repository. RKMs support various technologies, such as databases, XML files, and various types of flat files. You can also use RKMs to retrieve non-standard metadata from databases or applications, such as Oracle E-Business Suite, Siebel CRM, and PeopleSoft.

The RKM performs customized reverse engineering for a model. The RKM handles connecting to the application or metadata provider then transforming and writing the resulting metadata into the ODI repository. The metadata is written into temporary tables, and then the RKM executes the ODI API to read from these tables and write to the ODI metadata tables of the Work repository in incremental update mode.

Note that the ODI repository of Oracle BI Applications contains the relevant source data models. Therefore, you run an RKM only if you have customized tables in the source system and want to import these changes to the ODI repository. To customize tables and tasks, see Customizing the Oracle Business Analytics Warehouse.

During the reverse engineering process, the ODI datastore columns or constraints may vary in 12c as compared to 11g. You must consider this while performing customization reverse engineering. Consider these example situations:

Example 1: Column A and B are deleted from the database table.
In ODI 11g, Column A and B are retained in the ODI datastore. In ODI 12c, Column A and B are removed from the ODI datastore because the ODI datastore columns and constraints are equivalent to the source and target database metadata.
Example 2: Column C is renamed to Column "NEW" in the database table.
In ODI 11g, Column C is retained and Column "NEW" is added to the ODI datastore. In ODI 12c, Column C is replaced with Column "NEW" in the ODI datastore.
Example 3: Column "PK_COL3" is removed from the Primary Key.
In ODI 11g, Column "PK_COL3" is retained in the ODI datastore. In ODI 12c, Column "PK_COL3" is removed from the ODI datastore.

About Multi-Source Environments

Oracle BI Applications supports the loading of data into the Oracle Business Analytics Warehouse from multiple source systems. If the same adaptor is required for two different sources, then this requires the adaptors' model and maps to be duplicated in the ODI repository.

Consider a scenario in which you have a PeopleSoft 9.2 source and an Oracle EBS 12.2 source, both loading the target fact table ASTXACT_FG. Loading this target table includes three serial steps:

Initialize Target Table (ASTXACT_FG).
Load Target Table (ASTXACT_FG).
Finalize Target Table (ASTXACT_FG).

This figure shows these load plan steps in ODI Studio.

Description of multisource_lp.png follows

Description of the illustration multisource_lp.png

Data from both sources, PeopleSoft 9.2 and Oracle EBS 12.2, is loaded into the target table (ASTXACT_FG) for the first time, which means a full load occurred; and, generally, when a full load occurs, the target table is truncated. In this case, the load plans can run in parallel because the data sources are not dependent on each other; however, the load from the second source shouldn't truncate the target table if it already contains data from the first source. The following serial actions are performed in the process:

The Initialize Target Table step truncates the target table.
The Load Target Table step, in turn, has two child parallel steps. Each of these loads the target from each source system.
The Finalize Target Table step creates any indexes and analyzes the table.

The generated load plan ensures that you can have separate load plans for each source, but load plans should not run in parallel. Hence, the table is truncated only at the required time (in this case, only once before any source system loads it).

Note:

Configuration Manager and ODI do not stop load plans from running in parallel; however, it is recommended that you do not do so because of the following reasons:

All load plans truncate target tables upon the initial load. If you run load plans in parallel, one load plan can truncate the data loaded by the preceding load plan.
The mappings from SILOS onwards are common and not based on the source system. If load plans are run in parallel, you can have a situation in which only partial data is loaded from the second source due to the data that was loaded from the first source. To resolve this issue, make sure that the first load plan completes successfully before the second load plan runs.

In other words, when you load a fact table, bear in mind that the fact table might be connecting with multiple dimensions and lookup tables to get the final data. When load plans are running in parallel, some of the dimension, lookup, and staging tables can also have data from a second source. This might lead to some lookups and dimensions not returning appropriate value for the second source, since they have not yet been completely loaded.

When extracting data from different source types (Siebel and Peoplesoft for example) or different versions of the same source type (eBusiness Suite 12.0 and eBusiness Suite 12.1 for example), no additional steps are required. However, if extracting data from two or more instances of the same source version (such as two instances of eBusiness Suite 12.0), this is specifically referred to as 'Multi-Instance Support.' Multi-Instance requires additional steps to support. Refer to My Oracle Support document 1633996.1 —OBIA 11g: Multi-Instance Source Support in BI Applications 11.1.1.x.1, for more details.

About ETL Roles

Access to Configuration Manager and Functional Setup Manager (FSM) is controlled through these duty roles.

Oracle BI Applications has two duty roles for ETL operations:

Load Plan Operator Duty
Load Plan Administrator Duty

Oracle BI Applications has the following additional duty roles:

BI Applications Administrator Duty
BI Applications Functional Developer Duty
BI Applications Implementation Manager Duty

The security administrator must grant the appropriate duty roles to a user based on the user's job responsibilities. For information on the Configuration Managerand FSM screens that each duty role has access to, see Duty Roles for Access to Functional Setup Manager or Configuration Manager in Oracle Business Intelligence Applications Security Guide.

The BI Applications administrator, load plan operator, and load plan administrator users require appropriate access to ODI. In addition to these users being created in the LDAP system, these users must also be created in the ODI Repository, and they must be granted the Supervisor profile or an appropriate ODI profile. The BI Applications administrator must be granted the Supervisor role in ODI. Work with your security administrator to obtain the appropriate duty roles.

See Managing Security in Oracle Data Integrator in Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator .

About Cloud Sources

Performing ETL from Fusion Cloud sources is a collaborative effort. Data extracted from cloud-hosted source application database tables using standard SDE knowledge module logic is uploaded automatically to Oracle Universal Content management for uptake by on-premises ETL developers and operators.

At appropriate intervals, new or incremental datasets required to create and maintain the on-premises data warehouse are updated in an Oracle Source Dependent Schema (SDS), which serves as the on-premises copy of the cloud data for the purposes of aggregating, transforming, and loading it into the on-premises data warehouse. Once data is available in the cloud SDS database schema, you can create and run load plans to load the warehouse tables in Configuration Manager.

In the case of Taleo and RightNow data sources, data is sourced automatically directly from the respective cloud transactional tables and loaded into the data warehouse staging tables during SIL processing.

1 ETL Overview

Before Running ETL