Designing Data Integrator Projects

ETL Collaboration Overview

Once you have created an ETL collaboration using the Data Integrator Wizard, you can configure and customize the collaboration as needed to meet your data processing requirements. The following topics provide information about ETL collaborations and the ETL Collaboration Editor.

Execution Strategies

Sun Data Integrator automatically employs the most optimum execution strategy for collaborations. The strategy that is employed depends on the specific nature of the collaboration. If you do not want Data Integrator to determine the best execution strategy, you can configure a collaboration for either the staging or the pipeline execution strategy depending on what your Collaboration is set up to do. For example, if your collaboration business logic contains Java operators, you can only use the pipeline strategy. The following describes the criteria that Data Integrator uses to determine the best execution strategy, and these are also the criteria to use when you force an execution strategy.

Execution Strategy Selection

The execution strategy configuration for collaborations is set to the Best Fit option by default. When Data Integrator determines what execution strategy to use for a collaboration, it evaluates the collaboration for specific attributes. For example, it takes tables and columns into consideration. In addition, the selected execution strategy depends on whether a collaboration contains Java operators, which are operators that are not available across all supported databases. Examples of Java operators are date transformation operators and operators that parse business names and addresses and that normalize names. Projects with Java operators must be executed with a pipeline strategy.

You can force the execution strategy for a collaboration by changing its setting from Best Fit to Staging or Pipeline, as described in "Forcing Execution Strategies for Collaborations” .

Data Integrator uses the following execution methods depending on a collaboration's attributes:

Direct/Simple Execution Strategy

With the direct/simple execution strategy, all extraction, transformation, and loading happens in a single database. When the Best Fit option is enabled, Data Integrator uses this strategy under the following conditions:

All source tables and target tables reside in the same database.
No Java operators are used.
The data validation condition is not used.

One Pass Execution Strategy

With one pass execution, extraction and transformation occur in the source database. When the Best Fit option is enabled, Data Integrator uses this strategy under the following conditions:

All source tables are in the same database.
No Java operators are used.
The data validation condition is not used.

Staging Execution Strategy

With the staging execution strategy, all source tables are extracted from source databases and staged in temporary tables in the target database. Join and Transformation happens in the target database. This setting is used automatically when the Best Fit option is enabled and the conditions below occur. You can also select this option manually to force its use, in which case this execution strategy is recommended under the following conditions:

Source tables are scattered across different databases.
No Java operators are used.
The data validation condition is not used.

Pipeline Execution Strategy

With the Pipeline execution strategy, transformation and loading (indirectly to the target database table) occurs in the internal database engine. This setting is used automatically when the Best Fit option is enabled and the conditions below occur. You can also select this option manually to force its use, in which case this execution strategy is recommended under the following conditions:

All tables are flat file database tables.
Java operators are used.
The data validation condition is used.

Whitespace Considerations

Sun Data Integrator handles whitespace differently depending on the execution strategy. When joining a flat file table and an Oracle table where the comparison column in the Oracle table contains whitespace, refer to the table below.

Table 1–1 Execution Strategies for Flat File and Oracle Table Joins


Strategy Specified	Description
Best Fit	Uses the staging execution strategy since the source tables are from different databases. The results will be the same as if staging was selected.
Staging	Data Integrator extracts source tables from source databases and stages the data in temporary tables in the target database. By default, whitespace is trimmed.
Pipeline	Data Integrator uses an internal database engine instead of temporary tables, accessing data directly from the source tables rather than extracting it to temporary tables. To avoid whitespace causing failure in the join condition, add LTRIM/RTRIM operators to the Oracle table column. The result will be the same as Staging/Best Fit. Note – In future this feature will be made obsolete

Explicit and Implicit Joins

The join condition specified on source tables is an explicit join. The condition specified on target tables is an implicit join. The target condition is used differently in insert and update statements. For update statements, the condition from the target table is used to identify the proper rules to update and match the rules to the target. For insert statements, the condition from the target table is used to verify that no duplicate rules are inserted.

Runtime Properties

The Staging Table Name property is used for the staging execution strategy. When you use the staging strategy and specify a staging table name for each target table, the ETL collaboration does not create a temporary staging table for the source data. Instead the table in the default table space for the target database with the name specified for this property is used for staging.

When all the source tables in an ETL collaboration are configured with a valid table name in the Staging Table Name property, the ETL process does not create or delete any temporary tables at runtime. Also, the process does not modify or alter the target tables other than for updating records as per the ETL collaboration.

Be careful about changing the default settings. By default, the staging table are dropped after each run. If you do not want to drop the tables, you need to change the Drop Staging Table property to false. Also by default, the data in a temporary table is truncated before each run. If you do not want the table truncated, set the Drop Staging Table property to false.

Data Validation Conditions

Data Integrator provides operators to validate extracted data. You can validate multiple columns in a record through Data Validation Conditions. If the validation fails for at least one of the columns then the record is rejected, preventing it from being loaded into target tables. All windows that show conditions (for example, the Data Validation Condition window and the Extraction Condition window) provide the operators to enable you to model complex validation conditions. You can view rejected rows at design time. If a data validation condition is set, click Run Collaboration to see if any records fail validation. If rejected rows exist, right-click the target table and select Show Rejected Data. The rejected data displays in the Output pane.

About the ETL Collaboration Editor

You use the ETL Collaboration Editor to create the business logic for ETL processes. The table below describes the ETL Collaboration Editor toolbar.

Table 1–2 ETL Collaboration Editor Toolbar


Commands	Description
Source	Changes the editor display to show the Java source code of the collaboration.
Design	Changes the editor display to show a graphical representation of the collaboration.
Expand All Graph Objects	Expands the tables and graphical elements displayed on the editor to show all mapping elements and fields. This is the default view.
Collapse All Graph Objects	Collapses the tables and graphical elements displayed on the editor, making it easier to view the different components of the collaboration.
Toggle Output View	Toggles between a full-screen pane and a divided pane that shows output messages, like log entries, validations, source and target table data, SQL code, and rejected rows.
Drop and Recreate Tables	Drops all source and target database tables and then recreates them.
Refresh Metadata	Refreshes the metadata for the source and target tables.
Select Source and Target Tables	Enables you to select source and target tables to be used in the collaboration.
Create New Join	Launches the Create New Join View dialog box, where you can define source table relationships, or joins. The Create New Join View dialog box also appears when you do either of the following: Map a specific source table column to a target table that is already mapped to a different source table, and the two source tables are not already joined. Connect a specific source table column to an operator (such as concatenate) that is already connected to a different source table, and the two source tables are not already joined.
Edit Database Properties	Enables you to configure database OTDs to point the database URL to a different location for design time. This is a temporary setting for design time only; the setting is not saved with the OTD for runtime.
Add/Edit Runtime Inputs	Allows you to add input variables to the collaboration. Input variables that are assigned by an external system, such as through a business process, are called runtime inputs.
Add/Edit Runtime Outputs	Enables you to add runtime output variables to the collaboration.
Zoom In, Zoom Out, and Scale	Changes the scale of the objects in the collaboration. You can zoom in and out, and you can specify a scale percentage.
AutoLayout All Graph Objects	Automatically arranges all ETL Collaboration Editor window components.
Validate Collaboration	Validates the mapping logic without executing the project.
Run Collaboration	Executes the project and generates a message log that displays messages and errors if the execution fails.