Extracting Legacy Data Using the Data Integrator Wizard (Analyzing and Cleansing Data for Sun Master Index)

Analyzing and Cleansing Data for Sun Master Index

Extracting Legacy Data Using the Data Integrator Wizard

Note –

The Data Integrator Wizard was enhanced in Java CAPS 6 Update 1. If you are using Release 6, follow the instructions under Extracting Legacy Data Using the Generate Schema Tool. MySQL is only available in Release 6 Update 1.

The Data Integrator Wizard allows you to quickly create a staging database for the Data Profiler and Data Cleanser. Once you create the database, you only need to run the generated collaborations to extract the legacy data. Data Integrator generates the staging database based on the object structure defined for the master index, so the data is automatically presented in a format that the Data Cleanser and Data Profiler can read.

Perform the following tasks in the order given to use the Data Integrator Wizard to extract the legacy data:

Connecting to the Source Database

For this task, you might need to install the driver for your source database type. Once a driver is available, you can create a connection to the database that contains the data to be analyzed and cleansed.

To Connect to the Source Database

Before You Begin

Complete the steps under Configuring the Environment.

For Oracle and MySQL, copy the database driver from your database installation to appServer_Home/lib.

In the NetBeans Services window, expand Databases and then expand Drivers.

If you do not see the driver for your source database, do the following:
1. Right-click on Drivers and then select New Driver.
2. On the Driver dialog box, click Add, and then browse to the location of the driver file.
  
  The driver you use depends on the database vendor for your source database. If you are unsure of how to find the driver, refer to your database documentation.

Right-click the driver for your database, and then select Connect Using.

On the New Database Connection dialog box, fill in the following fields:
- Database URL - The URL to the source database. If you do not know the format for this, refer to your database documentation.
- User Name – The user login ID for the source database.
- Password – The password for the source database.

Click OK.

Under Databases in the Services window, expand the URL you entered above and view the tables in the database.

If there are no database tables present, the connection failed. Recheck the values you entered for the database connection.

Next Steps

Continue to Creating the Staging Database and ETL Collaboration.

Creating the Staging Database and ETL Collaboration

This procedure describes how to create the staging database using the automated wizard. Depending on the type of data source and the options you use, the wizard skips certain unnecessary steps.

To Create the Staging Database and ETL Collaboration

Before You Begin

Complete the steps under Connecting to the Source Database.

On the Select Type of ETL Loader on the New File Wizard, select Advanced Extract – Transform – Load (ETL).

Click Next.

The Select or Create Database window appears.

To select a staging database to use for external data sources (for this project only), do one of the following:
1. Select an existing database to use from the DB URL field.
2. Select Create and Use New Database, enter a name for a new database in the DB Name field, and then click Create Database. Select the new database in the DB URL field.
  
  Note –
  This database is required and is used for internal processing only.

Click Next.

The Choose Data Source window appears.

Do any of the following:
- If you do not have any file data sources, click Next and skip to step 15 (choosing JDBC data sources).
- To specify a file data source using a URL, enter the URL and click Add.
- To specify a file data source that is stored on your network, browse for and select a file containing source data in the Choose a File box, and then click Add.
- Repeat the above two steps until all file data sources are selected.

Click Next.

The Enter Table Details window appears, with the information for the first data file displayed.

If necessary, modify the table name, the type of data encoding, and the type of document that contains the source data.

Data Integrator automatically fills in these fields based on the information from the previous window, so the existing values should be correct.

Click Next.

If the data file is a spreadsheet, the Choose a Sheet window appears; otherwise, the Import Table MetaData window appears.

If the Choose a Sheet window appears, select the name of the sheet in the spreadsheet that contains the source data, and then click Next.

When the Import Table Metadata window appears, modify the information about the data file as needed.

Data Integrator automatically fills in this information, but you might need to customize it. For more information about the properties you can configure, see Virtual Database Table Metadata Options in Designing Data Integrator Projects.

Preview the information in the bottom portion of the window, and then click Next.

The Enter Column Properties window appears.

In the upper portion of the window, customize any of the column properties.

For more information about these properties, see Virtual Database Column Properties in Designing Data Integrator Projects.

Preview the information in the lower portion of the window, and then click Next.

Do one of the following:
1. If you selected multiple file data sources, the wizard returns to the Enter Table Details window with the attributes for a different file displayed. Repeat the above steps beginning with step 7.
2. If all the files you specified are configured, a dialog box appears confirming the database table creation. Click OK on the dialog box and continue to the next step.
  
  The Select JDBC Source Tables window appears.

If you specified file data sources, they are already listed under Selected Tables. Click Next if you have no JDBC data sources to specify, or do the following to specify a JDBC data source:
1. Under Available Connections, select the database that contains the source data.
2. If there are multiple schemas in the database, select the schema to use.
3. Under Schemas, select the tables that contain the source data and then click Select.
4. Click Next.
  
  If there are tables to join, the Select Source Tables for Join window appears; otherwise, the Generate Target Database window appears.

To define join conditions, do the following. If there are no join conditions, click Next and skip to step 17.
1. Under Available Tables, select the tables to join, and then click the right arrow to add them to the Selected Tables list.
2. In the Preview panel, click the drop-down menu at the top of the join box and select the type of join to use from one of the following options:
  - Inner – Use this if all tables to be joined contain the same column.
  - Left Outer – Use this if the results should always include the records from the left table in the join clause.
  - Right Outer – Use this if the results should always include the records from the right table in the join clause.
  - Full Outer – Use this if the results should always include the records from both the right and left tables in the join clause.
3. To specify columns to exclude from each joined table, click the Select Column tab in the Preview pane and deselect any columns to exclude.
4. Click Next.
  
  The Generate Target Database Master Index Model window appears.

To create the staging database, do the following:
1. Deselect the check box for Use Existing Database Target Tables.
2. In the Object Definition File field, browse to and select the object.xml file generated for the Master Index project.
  
  Note –
  This file is located in NetBeansProjects_Home/Project_Name/src/Configuration.
3. In the Target Database Folder field, select or enter the path where you want to store the database.
4. In the Target Database Name field, enter a name for the database.
5. Click Generate Database.

Click Next.

The Select JDBC Target Tables window appears. The target tables to load the extracted data into are already listed under Available Connections. It is not recommended you change these.

Click Next.

The Map Selected Collaboration Tables window appears.

To map source and target data, do the following:
1. To disable constraints on the target tables, select Disable Target Table Constraints.
2. Select the SQL statement type to use for the transfer. You can select insert, update, or both.
3. For each target table listed on the right, select one or more source tables from the list directly to the left of the target table. These are the source tables that will be mapped to the target in the collaboration.
  
  Note –
  If you do not specify a mapping here, the source tables do not appear in the ETL collaboration. You can add the source tables directly to the collaboration using the Select Source and Target Tables function. To select multiple source tables for one target, hold down the Control key while you select the required source tables. If you select multiple source tables for one target, the source tables are automatically joined.
4. Click Finish.
  
  The new ETL collaboration appears in the Projects window. If multiple collaboration are created, they are given the name you specified for the collaboration with a target table name appended.

Next Steps

Continue to Connecting to the Staging Database.

Connecting to the Staging Database

To connect to the staging database, you need to use the Virtual DB database driver. This requires entering information you entered for the database in step 17 of Creating the Staging Database and ETL Collaboration.

To Connect to the Staging Database

On the NetBeans Services window, expand Databases and then expand Drivers.

Right-click Virtual DB, and then select Connect Using.

On the New Database Connection dialog box, fill in the following fields:
- Database URL – The URL to the staging database in this format: jdbc:axiondb:DatabaseName:DatabaseLocation.
  
  DatabaseName is the name you gave the database in Creating the Staging Database; DatabaseLocation is the path to the staging database.
- User Name – The user login ID for the staging database.
- Password – The password for the staging database.
  
  The default user name and password for the staging database is “sa” for both.

Click OK.

Under Databases in the Services window, expand the URL you entered above and view the tables in the database.

If there are no database tables present, the connection failed. Recheck the values you entered for the database connection.

Next Steps

Continue to Extracting the Data Using Data Integrator.

Extracting the Data Using Data Integrator

In this task, you verify the mapping from the source database to the staging database and then extract the data from source to staging. Before beginning this step, make sure your source database is running.

Note –

For more information about extracting and loading data using Data Integrator, see Designing Data Integrator Projects.

To Extract the Data Using Data Integrator

Before You Begin

Complete the instructions under Connecting to the Source Database.

In the NetBeans Projects window, open one of the ETL collaborations that was generated from the Data Integrator Wizard.

Tip –
If one of the tables contains a field that is a foreign key to other tables, you should process that table first.

Verify the mapping of the data fields to be extracted from the source tables to the appropriate fields in the target tables.

Caution –
A unique value must be mapped to the ObjectID column in each target table (where Object is the name of the object for which the table stores data). You can concatenate the source system and local ID of each incoming record to create a unique value, or you can define a custom function to populate the unique values.

To delete a mapping, highlight the mapping line and press the Delete key.

To map a field, click the arrow next to the source field name and drag it to the target field name.

To configure the target tables, do the following for each target table:
1. Select the target table and open its Properties panel (select Properties in the Window menu).
2. Deselect Create Target Table.
3. (Optional) Select Truncate Target Table.

Save your work, and then click Run Collaboration on the ETL Collaboration Editor toolbar.

Tip –
If any errors occur, the error message appears in the output panel at the bottom of the window. To fix the collaboration, select a target table, open its Properties panel, and delete the value of the Target Join Condition property. Repeat this step for each target table. Save and rerun the collaboration.

After the extraction process is complete, verify the data in the target database table by right-clicking each table on the canvas and then selecting Show Data.

The data in the tables appears in the output panel.

Next Steps

Continue to Determining the Fields to Analyze.