Using DataStage Designer

This chapter provides an overview of DataStage Designer and discusses how to:

Note. This chapter does not discuss all the features available for DataStage Designer. For a complete view of DataStage Designer functionality, please see the delivered IBM WebSphere documentation.

Click to jump to parent topicDataStage Designer Overview

The DataStage Designer is the primary interface to the metadata repository and provides a graphical user interface that enables you to view, edit, and assemble DataStage objects from the repository needed to create an ETL job.

An ETL job should include source and target stages. Additionally, your server job can include transformation stages for data filtering, data validation, data aggregation, data calculations, data splitting for multiple outputs, and usage of user-defined variables or parameters. These stages allow the job design to be more flexible and reusable.

DataStage Designer enables you to:

DataStage Designer Window

The DataStage Designer window, which is the graphical user interface used to view, configure, and assemble DataStage objects, contains the following components:

The following diagrams show the layout of the DataStage Designer window components:

DataStage Designer Window - Layout View

The display area is in the right pane of the DataStage Designer window and displays the contents of a chosen object in the project tree.

By Default, the Designer window contains the Repository window, Tool Palette, and Diagram window. You can optionally view the Property Browser by selecting View, Property Browser from the menu bar .

The display of Designer windows and toolbars can be shown or hidden by selecting the appropriate option from the View menu. You can dock, undock, or rearrange the Designer windows.

Designer Menus

Most Designer menu items are also available in the toolbars. The following are some additional options that are available through the menus:

Designer Menu Item

Description

View, Customize Palette

Customize your palette.

View, Property Browser

Enables you to view and edit properties of a DataStage object.

Import

Enables you to import ETL projects, jobs, or other components that you export from another system, as well as DataStage components, such as table definitions, from text files or XML documents.

Export

Enables you to export DataStage objects in the form of text files with the file extension .dsx.

Tools, Run Multiple Job Compile

Enables you to compile all your jobs at the same time.

Tools, Run Director

Invoke the Director module, and log you into your project automatically.

Designer Toolbar

The Designer toolbar displays the following buttons:

This table describes the Designer toolbar buttons:

Designer Toolbar Button

Description

New

Open the New window where you can open a new DataStage object.

New (arrow down button)

Display options associated with the New command on the toolbar.

Open

Display the Open window that enables you to open an existing or recently opened repository object.

Save

Save the current job or container.

Save All

Save all open jobs or containers.

Job Properties

Open the Job Properties window for the current job open in the Diagram window.

Cut

Cut a specific object or text and temporarily stores it.

Copy

Copy a specific object or text and temporarily stores it.

Paste

Paste the temporarily stored object or text.

Undo

Undo the last task performed.

Redo

Redo the last task performed.

Quick Find

Search for DataStage objects using the quick find feature.

Advanced Find

Search for DataStage objects using the advanced find feature.

Data Flow Analysis

Use this function to display the data lineage for a column definition to see where in the job design that the column definition is used, display the source of the data for selected column or columns, display the target for the data for selected column or columns.

Construct Local Container

Create a local job container.

Construct Shared Container

Create a shared container reusable by other jobs.

Compile

Compile the current job.

Run

Run the current job.

Grid Lines

Show or hide a grid in the Diagram window.

Link Markers

Show or hide markers on the links.

Toggle Annotations

Show or hide annotations in the diagram window. You enter annotations by dragging the Annotation object from the Palette.

Stage Validation errors

See visual cues for parallel jobs or parallel-shared containers. The visual cues display compilation errors for every stage on the canvas, without you having to actually compile the job. The option is enabled by default

Snap to Grid

When the grid is shown and Snap to Grid is enabled, align objects that you drag with the grid.

Zoom In

Magnify the diagram display.

Zoom Out

Shrink the diagram display.

Print

Print the current diagram window.

Generate Report

Generate an HTML report of a server, parallel, or mainframe job or shared container. You can view this report in a standard Internet browser.

Help on View

View context-sensitive help.

Debug Toolbar

The Debug toolbar provides basic functions for testing and troubleshooting your jobs.

The Debug toolbar can be accessed by selecting View, Debug and displays the following buttons:

This table describes the Debug toolbar buttons:

Debug Toolbar Button

Description

Set target debug job

Enables you to select the job you want to debug.

Start/Continue Debugging

Start or stop running in debug mode.

Next Link

Run the job until you come to the next link.

Next Row

Run until you get to the next row.

Stop Job

Stop the job run.

Set debug Job Parameters

Set job parameters.

Edit Breakpoints

Change breakpoints (pauses that you have inserted into the run).

Toggle Breakpoints

Enable or disable breakpoints.

Clear All Breakpoints

Clear breakpoints.

View Job Log in Director

Open the job log in the Director module.

Show/Hide Debug Window

Display or hide the debug window.

All of the Debug toolbar options are also available from the Debug menu.

Click to jump to parent topicManaging Repository Objects

You can use DataStage Designer to view job categories, which serve to organize repository objects.

You can view the following repository objects within a job category:

You can also create new repository objects:

You can also copy, rename, edit, delete, or move an item using the File menu commands or the item level shortcut menu.

Click to jump to parent topicEditing Object Properties

Object properties consist of descriptive information and other types of information, depending on the object type.

Using DataStage Designer you can:

DataStage Designer - Object Properties

The following is an example of an object property for the String data element:

Click to jump to top of pageClick to jump to parent topicEditing Job and Job Sequence Properties

DataStage Designer enables you to:

DataStage Designer - Job/Job Sequence Properties

The following is an example of a server job property:

Click to jump to top of pageClick to jump to parent topicEditing Server Routines

You can create, edit, or view server routines using the Routine window. Argument names in built-in routines cannot be changed.

The following components are classified as routines:

Click to jump to top of pageClick to jump to parent topicEditing the Stage Type

The Stage Type category in the project tree contains all the stage types that you can use in your jobs. Properties of WebSphere DataStage's pre-built stages are read-only.

You can create or edit object properties for the following stage types:

DataStage Designer enables you to create and register plug-in stages to perform specific tasks that the built-in stages do not support. You need to register custom plug-in stages before you can use them. In addition, DataStage Designer enables you to create custom parallel stage types.

Click to jump to top of pageClick to jump to parent topicSpecifying Table Definitions

DataStage Designer enables you to:

DataStage Designer - Table Definitions

Table definitions:

Click to jump to parent topicImporting and Exporting Repository Components

Using the DataStage Designer import and export facilities enable you to move jobs or other components between projects. You can also move projects, jobs, or components from one system to another. In addition, you can import components from text files or XML documents, and you can export to XML documents. XML documents can be used as a convenient way to view descriptions of repository objects using a web browser.

Importing

The DataStage Designer import facility enables you to import:

You can use the Import facility to import table definitions from a variety of file types, including sequential files, ODBC, and XML.

Exporting

The DataStage Designer export facility enables you to export:

When you export projects or components, by default they are stored in text files with the file extension .dsx. You can also export to XML files by selecting the appropriate check box in the Export window. You also have the option to append the exported items to an existing file.

Click to jump to parent topicUsing Table Definitions

Table definitions are:

You need a table definition for each data source stage or data target stage you use in your job. You can import, create, or edit a table definition using DataStage Designer.

Click to jump to top of pageClick to jump to parent topicCreating Table Definitions

To create a new Table Definition, select New Table Definition from the Table Definition menu. The Table Definition window appears:

The Table Definition window has these tabs:

Click to jump to top of pageClick to jump to parent topicImporting Table Definitions

You can directly import a table definition from a source or target database. You can import table definitions from ODBC data sources, plug-in stages, UniVerse tables, hash files, UniData files, or sequential files.

In the DataStage Designer Repository window, right-click on Table Definitions. Select Import.

You can select the type of table definition data source from the available options.

Click to jump to parent topicBuilding DataStage Jobs

DataStage provides these types of jobs:

DataStage Jobs

The following is an example of one of the delivered Campus Solutions Warehouse server jobs:

Perform the following steps to build a job:

  1. Define optional project-level environment variables in DataStage Administrator.

  2. Define optional environment parameters.

  3. Import or create table definitions, if they are not already available.

  4. Add stages and links to the job to indicate data flow.

  5. Edit source and target stages to designate data sources, table definitions, file names, and so on.

  6. Edit transformer and processing stages to perform various functions, include filters, create lookups, and use expressions.

  7. Save, compile, troubleshoot, and run the job.

Click to jump to parent topicUsing Database and File Stages

Database stages represent data sources or data targets.

DataStage provides three types of stages:

Each stage has a set of predefined and editable properties.

Click to jump to top of pageClick to jump to parent topicServer Job Database Stages

The following are some of the delivered server job database stages:

Click to jump to top of pageClick to jump to parent topicServer Job File Stages

The delivered server job file stages are:

Click to jump to top of pageClick to jump to parent topicDynamic Relational Stages

Dynamic Relational Stages (DRS):

PeopleSoft-delivered ETL jobs use the DRS stage for all database sources or targets. This is represented in the Database group as "Dynamic RDBMS." When you create jobs, it is advisable to use the DRS stage rather than a specific type such as DB2 because a DRS will dynamically handle all of PeopleSoft supported database platforms.

The following example shows a DRS database stage in a delivered Campus Solutions Warehouse job:

A DRS database stage supports the following relational databases:

A DRS database stage also supports any generic ODBC interface.

Editing the DRS Stage

You edit the DRS properties using the DRS stage window.

  1. Double-click the DRS stage to open the DRS stage window.

  2. The DRS stage window contains two main tabs: the Stage tab and the Output tab:

The Stage tab contains two tabs: the General tab and the NLS tab. In the General tab, you define the source database type, database or connection name, user ID, and password used in that connection. The previous example uses environment variables to define the values of these fields. If environment variables or job parameters were not used in the DRS stage, you define the actual values in these fields.

Entering Information in the Output Window

The Output tab contains General, Columns, Selection, and SQL tabs:

In this example, the table name listed is the source of the data that this stage uses.

The Columns window shown below enables you to select which columns of data you want to pass through to the next stage. When you click the Load button, the system queries the source table and populates the grid with all the column names and properties. You can then delete rows that are not needed.

The following example shows the Columns window:

The Selection window enables you to enter a Structured Query Language (SQL) WHERE clause that specifies conditions when fetching data from tables.

Entering a WHERE clause in the Selection window is optional.

The following shows the SQL tab of a DRS stage:

The SQL tab contains the SQL statement used for the current stage.

Window Element

Usage

Generated

Shows the SQL SELECT statement that is automatically generated by this stage. It is read-only.

Before

Enter optional SQL statements executed before the stage processes job data rows. This does not appear in every plug-in.

After

Enter optional SQL statements executed after the stage processes job data rows This does not appear in every plug-in.

Note. You can define SQL in a DRS Stage.

Click to jump to top of pageClick to jump to parent topicProcessing Stages

DataStage Processing Stages:

Processing Stage Types

This table describes the different types of Processing Stages:

Processing Stage

Description

Transformer

Transformer stages perform transformations and conversions on extracted data.

Aggregator

Aggregator stages group data from a single input link and perform aggregation functions such as COUNT, SUM, AVERAGE, FIRST, LAST, MIN, and MAX.

FTP

FTP Stages transfer files to other machines.

Link Collector

Link Collectors collect partitioned data and pieces them together.

Interprocess

An InterProcess (IPC) stage is a passive stage which provides a communication channel between WebSphere DataStage processes running simultaneously in the same job. It allows you to design jobs that run on SMP systems with great performance benefits.

Pivot

Pivot, an active stage, maps sets of columns in an input table to a single column in an output table.

Sort

Sort Stages allow you to perform Sort operations.

Transformer Stages

Transformer stages enable you to:

The following is an example of a delivered Transformer Stage (Trans_Assign_Values Stage):

Creating Transformer Stages

You create a transformer stage by opening the Processing group in the palette, selecting the Transformer stage, and clicking in the Diagram window. After creating links to connect the transformer to a minimum of two other stages (the input and output stages), double-click the Transformer icon to open the Transformer window.

In the example above, two boxes are shown in the upper area of the window representing two links. Transformer stages can have any number of links with a minimum of two. Hence, there could be any number of boxes in the upper area of the window. Labeling your links appropriately makes it easier for you to work in the Transformer Stage window.

The lines that connect the links define how the data flows between them. When you first create a new transformer, you link it to other stages, and then open it for editing. There will not be any lines connecting the Link boxes. These connections can be created manually by clicking and dragging from a particular column of one link to a column in another link, or by selecting the Column Auto-Match button on the toolbar.

Using the Transformer Stage Toolbar

The following buttons appear on the Transformer Stage toolbar:

This table describes the buttons provided with the Transformer Stage toolbar

Transformer Toolbar Button

Usage

Stage Properties

Define stage inputs and outputs when you link the transformer with other stages.

Specify before-stage and after-stage subroutines (optional).

Define stage variables.

Define order in which input and output links are processed if there is more than one input or output link.

Constraints

Enter a condition that filters incoming data, allowing only the rows that meet the constraint criteria to flow to the next stage.

Show All or Selected Relations

If you have more than two links in the transformer, you can select one link and click this button to hide all connection lines except for those on the selected link. With only two links present, clicking this button hides or displays all connections.

Show/Hide Stage Variables

Show or hide a box that displays local stage variables that can be assigned values in expressions, or be used in expressions.

Cut, Copy, Paste, Find/Replace

These are standard Windows buttons.

Load Column Definition

Load a table definition from the repository, or import a new one from a database.

Save Column Definition

Save a column definition in the repository so that it can be used in other stages and jobs.

Column Auto-Match

Automatically sets columns on an output link to be derived from matching columns on an input link. You can then go back and edit individual output link columns where you want a different derivation.

Input Link Execution Order

Order the reference links. The primary data link is always processed first.

Output Link Execution Order

Order all output links.

Click to jump to parent topicAdding and Linking Stages

Stages represent inputs, outputs, and transformations within a job. Links join the stages together and show the flow of data within the job.

You add stages and links to a job by clicking the stage type or link in the palette and then clicking in the diagram window.

The following example shows a job that contains stages and links:

A stage typically has at least one input or one output. However, some stages can have multiple inputs and output to more than one stage.

Different types of job have different stage types. The stages that are available in the DataStage Designer are dependent on the job type that is currently open in the DataStage Designer.

Adding Stages

To add a stage to a job, click a stage type in the palette, and click in the Diagram window.

The stages are located as follows:

If the link is red, then the link is broken. Start and end the drag motion in the center of each stage to ensure that you have linked the stages correctly.

Adding Links

To add a link between stages, you click the Link object in the General palette group, and then click and drag the cursor from one stage to another.

Another option is to right-click on one stage and drag the link to another stage.

By default, new links are named. However, we recommend that you rename all of your links to reflect their purpose and avoid confusion when you are editing transformers and stage properties.

Click to jump to parent topicCompiling and Running Jobs

Before running a job you must always:

Compiling a Job

To compile a job, click the Compile button on the DataStage Designer toolbar. After compiling the job, the result appears in the display area. If the result of the compilation is Job successfully compiled with no errors, you can schedule or run the job. If an error is displayed, you can click the Show Error button to highlight the stage where the problem occurs. Ensure that you have specified all the input and output column definitions, directory paths, file names, and table names correctly.

Click to jump to top of pageClick to jump to parent topicCriteria Checked when Compiling Jobs

The link to the source data stage is called the primary link. All other input links are called reference links.

During compilation, the following criteria in the job design are checked:

Click to jump to top of pageClick to jump to parent topicSpecifying Job Run Options

After compiling jobs, they become executable. The executable version of the job is stored in your project along with your job design.

To run a job, click the Run button on the DataStage Designer toolbar. After clicking the Run button the Job Run Options window appears, where you can specify information on running a server job.

In the Parameters tab, you enter specific parameter values for the job. You specify job parameters in the job properties window. You can create job-specific parameters or use an environment variable defined in DataStage Administrator. When running jobs, the parameters required to run the job are displayed in the Parameters tab of the Job Run Options window. If you specified default values in your job properties, these are displayed in the Parameters tab.

When setting values for environment variables, you can specify either $PROJDEF,, $ENV, or $UNSET special values:

In the Limits tab, you specify any run time limits.

You can specify whether stages in the job should be limited in how many rows they process and whether runtime error warnings should be ignored.

You specify whether the job should generate operational metadata in the General tab.

You can also disable any message handlers specified for the job run in the General tab.

Click to jump to parent topicEditing Job Properties

The Job Properties window enables you to:

Editing Job Properties

To edit job properties, click the Job Properties button on the DataStage Designer toolbar.

The Job Properties window contains the following tabs:

Job Properties Page

Description

General

Enter name, category description, version number, before and after job subroutines, and their input values.

Parameters

Define parameters to represent processing variables. Operators can be prompted for values at run time.

Job Control

Set up a job control routine using BASIC functions to call and run other jobs from the current job. You can also set up job control by using the Sequence Editor in the Designer module.

Dependencies

Enter any dependencies that this job has on functions, routines, or other jobs.

Performance

Displays options for improving performance.

Click to jump to parent topicUsing Expressions

Expressions define a value that is evaluated at run time.

Simple expressions can contain:

A complex expression can contain a combination of constants, variables, operators, functions, and other expressions.

Click to jump to top of pageClick to jump to parent topicAccessing Expressions

You can access expressions by double-clicking a Transformer Stage within a job. Next you double-click the Derivation cell for any column in a transformer link and the Expression Editor opens. You can type an expression directly into the editor, or use the menu by clicking the Suggest button on the right side.

Different menus display depending on whether you right-click an input link, output link, the stage variable table, or links area background. The different menus are also dependent on what type of job you are working on (Server, Parallel, or Mainframe). The output link menu includes operations on Derivations. The input link menu includes operations on key expressions. The stage variable menu includes operations on stage variables.

To insert a function in your expression, click the Suggest button and select Function. The following choices are displayed:

Click the plus sign next to the function type to display the functions categorized under them.

Click to jump to top of pageClick to jump to parent topicExpression Editor Options

This table describes the menu options for the Expression Editor:

Expression Editor Menu Options

Description

DS Macro

Insert a built-in DataStage BASIC Macro.

DS Function

Insert a built-in DataStage BASIC function.

DS Constant

Insert a predefined constant.

DS Routine

Insert a routine from the repository.

DS Transform

Insert a transform from the repository.

Job Parameter

Insert a parameter.

Input Column

Insert an input column name.

Link Variables

Insert a link variable

Stage Variables

Insert a stage variable.

System Variables

Insert a system variable.

String

Insert a string.

Function

Insert a function.

() Parentheses

Insert parentheses.

If Then Else

Insert If Then Else logic into the expression.

Click to jump to parent topicCreating Constraints

Constraints, like expressions, enable you to filter or limit data based on criteria that you enter.

You can define a constraint or specify a reject link by both selecting an output link and clicking the Edit constraints button on the toolbar, or by double-clicking the output link header Constraint entry.

You can specify a constraint for each output link from a Transformer stage. You can also specify a particular link as a reject link. Reject links output rows that have not been written to any other output links from the Transformer stage.

In the example above, using the expression InsertFlag="Y" as a constraint verifies whether the lookup stages return a value. If no values were returned by the lookup stages, the InsertFlag field is set to N.

Click to jump to parent topicUsing Hashed File Stages

Using hashed files improves job performance by enabling validation of incoming data rows without having to query a database each time a row is processed. These are called lookups. The hashed file can also be placed locally, eliminating time that would be spent accessing a remote server.

You can create hashed files to use as lookups in your jobs by running one of the delivered hash file jobs, or you can create a new job that creates a target hashed file. In many of the delivered PeopleSoft sequence jobs, the appropriate hashed file is refreshed as the last step following the load of the data table, which ensures synchronized updates to the data in the hashed file for use in future lookups.

Hashed file stages:

Click to jump to top of pageClick to jump to parent topicAccessing Hashed File Stages

To access a hashed file stage, double-click the hashed file stage in a job.

The Hashed File Stage window contains the following tabs:

Window Element

Usage

Stage tab

Define whether an account name or a directory path accesses the hashed file.

Inputs or Output tab

If the stage has an input, the Input tab is available.

If it has an output, the Output tab is available.

Inputs - General tab

Enter the hashed file name.

Select whether to create a new file, if none exists.

Select whether to clear the file before writing to it.

Inputs - Columns tab

Select which columns of data will be written to the file.

Click to jump to top of pageClick to jump to parent topicCreating Hashed File Lookups

Lookups are references that enable you to compare each incoming row of data to a list of valid values, and then accept or reject that row based on the validation result.

DataStage job stages can have two types of input links:

Stream links, represented by solid lines, can connect either active or passive stages. Reference links, shown as dotted lines, are only used by active stages. Their purpose is to provide information that may affect how the data is changed, but they do not supply the actual data to be changed.

Typically, hashed files are used as lookups because they are much quicker to access than querying a database. Hashed files used as lookups usually contain only one or two key columns against which incoming data can be validated.

Before you can create the lookup, you must first create a hashed file containing the values to be used as a reference. To add a lookup stage to a job, you select the hashed file stage from the File palette, enter the directory path and file name of the hashed file, and link the hashed file stage to a transformer stage.

You use a DRS stage as a lookup when your lookup requires that use of relational operators, such as >= and <=.

Click to jump to parent topicUsing Job Sequencers

Job Sequencers enable you to:

Designing job sequencers is similar to designing server jobs. You create the job sequence in DataStage Designer, add activities from the palette, and then join or sequence activities together using links. You control the flow of the activity sequence using triggers.

Once you have defined a job sequence, you can schedule and run the job from DataStage Director.

Note. Job sequencers can also be run using DataStage Director.

Click to jump to parent topicUsing DataStage BASIC

If you need to load data from a non-PeopleSoft source, you usually do not have to perform any programming tasks: you can use the delivered transforms and routines, using the delivered jobs and sequences as templates. For more complex jobs, you can use DataStage BASIC to:

PeopleSoft provides ETL jobs for loading data from PeopleSoft applications into EPM. Some of the PeopleSoft jobs use custom routines using DataStage BASIC.