Understanding IBM WebSphere DataStage

PeopleSoft has an original equipment manufacturer (OEM) agreement with IBM for its WebSphere DataStage ETL tool and bundles this offering with PeopleSoft EPM. The IBM WebSphere DataStage tool uses ETL jobs to target specific data from a PeopleSoft source database and migrate it to the OWS, OWE, and MDW tables. IBM WebSphere DataStage is comprised of a server tool and client tool, which are discussed in more detail below.

IBM WebSphere DataStage provides the following features:

Graphical design tools for designing ETL maps (called jobs)
Data extraction from a variety of data sources
Data aggregation using SQL SELECT statements
Data conversion using predefined or user-defined transformations and functions
Data loading using predefined or user-defined jobs

IBM WebSphere DataStage Terminology

You should be familiar with these IBM WebSphere DataStage terms:

Term	Definition
Administrators	Administrators maintain and configure DataStage projects.
Aggregator Stages	Aggregator stages compute totals or other functions of sets of data.
Data Elements	Data elements specify the type of data in a column and how the data is converted.
Container Stages	Container stages group reusable stages and links in a job design.
DataStage Package Installer	This tool enables you to install packaged DataStage jobs and plug-ins.
Hashed File	A hashed file groups one or more related files plus a file dictionary. DataStage creates hashed files when you run a job that creates hash files (these are delivered with PeopleSoft EPM). Hashed files are useful for storing data from tables from a remote database if they are queried frequently, for instance, as a lookup table.
Hashed File Stage	A hashed file stage extracts data from or loads data into a database containing hashed files. You can also use hashed file stages as lookups. PeopleSoft ETL jobs use hashed files as lookups.
Inter-process Stage	An inter-process stage allows you to run server jobs in parallel on a symmetric multiprocessing system.
Plug-in Stages	Plug-in stages perform processing that is not supported by the standard server job stage.
Sequential File Stage	A sequential file stage extracts data from or writes data to a text file.
Transform Function	A transform function takes one value and computes another value from it.
Transformer Stages	Transformer stages handle data, perform any conversions required, and pass data to another stage.
Job	A job is a collection of linked stages, data elements, and transforms that define how to extract, cleanse, transform, integrate, and load data into a target database. Jobs can either be server or mainframe jobs.
Job Sequence	Job sequence invokes and runs other jobs.
Join Stages	Join stages are mainframe processing stages or parallel job active stages that join two input sources.
Metadata	Metadata is data about data; for example, a table definition describing columns in which data is structured.

DataStage Server

The IBM WebSphere DataStage server enables you to schedule and run your ETL jobs:

Image: DataStage Sever

This diagram illustrates the DataStage Server

Three components comprise the DataStage server:

Repository

The Repository stores all the information required for building and running an ETL job.
DataStage Server

The DataStage Server runs jobs that extract, transform, and load data into the warehouse.
DataStage Package Installer

The DataStage Package Installer installs packaged jobs and plug-ins.

DataStage Client

The IBM WebSphere DataStage client enables you to administer projects, edit repository contents, and create, edit, schedule, run, and monitor ETL jobs.

Three components comprise the DataStage client:

DataStage Administrator
DataStage Designer
DataStage Director

DataStage Administrator

The DataStage Administrator enables you to:

Image: DataStage Administrator

This diagram illustrates the tasks that can be performed by the DataStage Administrator.

See Setting DataStage Server Properties and Setting Project Properties.

DataStage Designer

DataStage Designer enables you to:

Create, edit, and view objects in the metadata repository.
Create, edit, and view data elements, table definitions, transforms, and routines.
Import and export DataStage components, such as projects, jobs, and job components.
Create ETL jobs, job sequences, containers, routines, and job templates.
Create and use parameters within jobs.
Insert and link stages into jobs.
Set stage and job properties.
Load and save table definitions.
Save, compile, and run jobs.

See DataStage Designer Overview.

DataStage Director

DataStage Director enables you to:

Validate jobs.
Schedule jobs.
Run jobs.
Monitor jobs.
View log entries and job statistics.

See DataStage Director Overview.

Key DataStage Components

IBM WebSphere DataStage contains many different components that support the ETL process. Some of these components include stages, jobs, and parameters. Only the following key DataStage components are discussed in this topic:

DSX Files
Jobs
Hashed Files
Environmental Parameters
Shared Containers
Routines

A complete list of all DataStage components can be found in the WebSphere DataStage Development: Designer Client Guide.

DSX Files

PeopleSoft delivers a *.dsx file for each functional area within EPM. As part of your installation and configuration process you import the *.dsx file into a project that has been defined in your development environment. Included in the *.dsx file are various DataStage objects that define your project. The *.dsx files are organized by functional area and contain related ETL jobs.

Note that *.dsx files use the following naming convention:

<file_name>.dsx, for DataStage 8.1 customers.

For example, Common_Utilities.dsx.
<file_name>_85.dsx, for DataStage 8.5 customers.

For example, Common_Utilities_85.dsx.

To see a list of the PeopleSoft-delivered *.dsx files, refer to the file "DSX Files Import Description.xls" located in the following install CD directory path: <PSHOME>\SRC\ETL.

Each delivered *.dsx file contains the DataStage objects described in the following topics.

ETL Jobs

PeopleSoft delivers predefined ETL jobs for use with IBM WebSphere DataStage. ETL Jobs are a collection of linked stages, data elements, and transformations that define how to extract, transform, and load data into a target database. Stages are used to transform or aggregate data, and lookup information. More simply, ETL jobs extract data from source tables, process it, then write the data to target warehouse tables.

PeopleSoft deliver five types of jobs that perform different functions depending on the data being processed, and the warehouse layer in which it is being processed:

Load Stage	Type	Description
I	Source to OWS	Jobs in this category extract data from your PeopleSoft transaction system and populate target warehouse tables in the OWS layer of the warehouse. Source to OWS jobs assign a source system ID (SRC_SYS_ID) for the transaction system from which you are extracting data and populate the target OWS tables with that ID.
I	Source to MDW	Jobs in this category extract data from your transaction system and populate target dimension and fact tables in the MDW layer of the warehouse. The Online Marketing data mart is the only product to use this type of job.
II	OWS to OWE	Jobs in this category extract data from the OWS tables and populate target D00, F00, and base tables in the OWE layer of the warehouse. OWS to OWE jobs perform lookup validations for the target OWE tables to ensure there are no information gaps and maintain referential integrity. Many of the jobs aggregate your transaction data for the target F00 tables.
II	OWS to MDW	Jobs in this category extract data from the OWS tables and populate target DIM and FACT tables in the MDW layer of the warehouse. OWS to MDW jobs generate a surrogate key that helps facilitate dimension key resolution. The surrogate key value is used as the primary key in the target DIM table and as the foreign key in the FACT table. The jobs also perform lookup validations for the target DIM and FACT tables to ensure there are no information gaps and maintain referential integrity.
II	OWE to MDW	Jobs in this category extract data from the OWE tables and populate target DIM and FACT tables in the MDW layer of the warehouse. Properties of this job type mirror those of the OWS to MDW job. OWE to MDW jobs generate a surrogate key that helps facilitate dimension key resolution. The surrogate key value is used as the primary key in the target DIM table and as the foreign key in the FACT table. The jobs also perform lookup validations for the target DIM and FACT tables to ensure there are no information gaps and maintain referential integrity.

All job types identified in the table are incremental load jobs. Incremental load jobs identify and extract only new or changed source records and bring it into target warehouse tables.

See Understanding ETL Load Strategies in EPM.

ETL Jobs - Naming Convention

PeopleSoft use standard naming conventions for all ETL jobs; this ensures consistency across different projects. The following table provides the naming conventions for PeopleSoft delivered ETL jobs.

Object	Naming Convention	Example
Staging Server Job	J_Stage_[Staging Table Name]_[Source Release]_[EPM Release]	J_Stage_PS_AGING_TBL_FSCM91_EPM91
Sequencer Job	SEQ_[Staging Table Name]_[Source Release]_[EPM Release]	SEQ_J_Stage_PS_AGING_TBL_FSCM91_EPM91
CRC Initial Load Job	J_Hash_PS_[Staging Table Name]_[Source Release]_[EPM Release]	J_Hash_PS_AGING_TBL_FSCM91_EPM91
Common Lookup Load Job	J_Hash_PS_[Table Name]	J_Hash_PS_D_LRNG_ENV
MDW Dimension Job	J_Dim_PS_[Dimension Table Name]	J_Dim_PS_D_DEPT
MDW Fact Job	J_Fact_PS_[Fact Table Name]	J_Fact_PS_F_ENRLMT
OWE Dimension Job	J_D00_PS_[D00 Table Name without D00 Suffix]	J_D00_PS_ACCOMP_D00
OWE Fact Job	J_F00_PS_[F00 Table Name without F00 Suffix]	J_F00_PS_JOB_F00
OWE Base Job	J_BASE_PS_[Base OWE Table Name]	J_BASE_PS_XYZ

Hashed Files

Hash files are views of specific EPM warehouse tables and contain only a subset of the data available in the warehouse tables. These streamlined versions of warehouse tables are used to perform data validation (lookups) within an ETL job and select specific data from lookup tables (such as sourceID fields in dimensions).

In the validation (lookup) process the smaller hash file is accessed, rather than the base warehouse table, improving performance. The following diagram provides an example of a hash file lookup in a job.

Image: Lookup process using hash file

The following diagram provides an example of a hash file lookup in a job.

The following detailed view of an ETL job shows the Institution hashed file lookup in the Campus Solutions Warehouse J_Fact_PS_F_STU_RECRT job.

Image: Institution hashed file lookup in the J_Fact_PS_F_STU_RECRT job

This example illustrates the Institution hashed file lookup in the J_Fact_PS_F_STU_RECRT job.

A detailed view of the hashed file stage reveals the fields (including keys) the lookup uses to validate Institution records.

Image: Hashed file stage in the J_Fact_PS_F_STU_RECRT job

This example illustrates the Hashed file stage in the J_Fact_PS_F_STU_RECRT job.

Because hash files are vital to the lookup process, jobs cannot function properly until all hash files are created and populated with data. Before you run any job that requires a hash file, you must first run all jobs that create and load the hash files—also called initial hash file load jobs.

After hash files are created and populated by the initial hash file load jobs, they are updated on a regular basis by the delivered sequencer jobs. Hash files are updated in the same job as its related target warehouse table is updated. In other words, both the target warehouse table and the related hash file are updated in the same sequencer job. The successful load of the target warehouse table in the job triggers the load of the related hash file.

Image: Hash file update process

The following diagram provides an example of the this process.

See Understanding Data Validation and Error Handling in the ETL Process, Incremental Loading Using the Insert Flag and Lookup Validations.

Environmental Parameters

Environmental parameters are user-defined values that represent processing variables in your ETL jobs. Environmental parameters are reusable so they enable you to define a processing variable once and use it in several jobs. They also help standardize your jobs.

Though environmental parameters are reusable, PeopleSoft delivers specific environmental parameters for jobs related to each phase of data movement (such as the OWS to MDW jobs). Therefore, a single environmental parameter is not used across all ETL jobs, rather a subset of variables are used depending on the specific functionality of the job.

See Environmental Parameters Information.

Shared Containers

Shared containers are reusable job elements. A shared container is usually comprised of groups of stages and links, and is stored in the DataStage repository. You can use shared containers to make common job components available throughout your project. Because shared containers are reusable you can define them once and use them in any number of your ETL jobs. PeopleSoft delivers the following shared containers:

StoreLangStagingList
StoreMaxLastUpdDttm
StoreMaxRecordID
StorMaxSnapDate
StoreMaxValueDecimal
FactStoreMaxRecordID

Routines

Routines are a set of instructions, or logic, that perform a task within a job. For example, the ToInteger routine converts the input value to an integer. Because routines are reusable you can use them in any number of your ETL jobs.

See Routine Descriptions.

IBM Documentation

For more details on the IBM WebSphere DataStage tool and how to use it, refer to the IBM documentation listed below. You can install PDF versions of the IBM books as part of the IBM WebSphere tools install.

The following table lists the IBM documentation and the information provided.

IBM Book	Description
IBM Information Server: Planning Installation and Configuration Guide	Provides planning information and complete installation instructions for IBM Information Server. Also includes information about troubleshooting, validating the installation, and configuring the system.
IBM Information Server: Administration Guide	Describes how suite administrators can manage user access to components and features of IBM Information Server. In addition, describes how suite administrators can create and manage views of logged events and scheduled tasks for all components.
WebSphere DataStage Administration: Administrator Client Guide	Describes the WebSphere DataStage Administrator client and describes how to perform setup, routine housekeeping, and administration of the WebSphere DataStage engine.
WebSphere DataStage Administration: Deployment Guide	Describes how to package and deploy WebSphere DataStage jobs and associated objects to assist in moving projects from development to production.
WebSphere DataStage Administration: Director Client Guide	Describes the WebSphere DataStage Director client and explains how to validate, schedule, run, and monitor WebSphere DataStage parallel jobs and server jobs.
WebSphere DataStage Administration: National Language Support Guide	Describes how to use the national language support (NLS) features that are available in WebSphere DataStage when NLS is installed.
WebSphere DataStage Development: Designer Client Guide	Describes the WebSphere DataStage Designer client and gives a general description of how to create, design, and develop a WebSphere DataStage application
WebSphere DataStage Development: Server Job Developer Guide	Describes the tools that build a server job, and supplies programming reference information