3 Files

This chapter describes how to work with Files in Oracle Data Integrator.

This chapter includes the following sections:

Section 3.1, "Introduction"
Section 3.2, "Installation and Configuration"
Section 3.3, "Setting up the Topology"
Section 3.4, "Setting Up an Integration Project"
Section 3.5, "Creating and Reverse-Engineering a File Model"
Section 3.6, "Designing a Mapping"

3.1 Introduction

Oracle Data Integrator supports fixed or delimited files containing ASCII or EBCDIC data.

3.1.1 Concepts

The File technology concepts map the Oracle Data Integrator concepts as follows: A File server corresponds to an Oracle Data Integrator data server. In this File server, a directory containing files corresponds to a physical schema. A group of flat files within a directory corresponds to an Oracle Data Integrator model, in which each file corresponds to a datastore. The fields in the files correspond to the datastore columns.

Oracle Data Integrator provides a built-in driver for Files and knowledge modules for integrating Files using this driver, using the metadata declared in the File data model and in the topology.

Most technologies also have specific features for interacting with flat files, such as database loaders, utilities, and external tables. Oracle Data Integrator can also benefit from these features by using technology-specific Knowledge Modules. In terms of performance, it is most of the time recommended to use database utilities when handling flat files.

Note that the File technology concerns flat files (fixed and delimited). XML files are covered in Chapter 5, "XML Files".

3.1.2 Knowledge Modules

Oracle Data Integrator provides the knowledge modules (KM) listed in this section for handling File data using the File driver.

Note that the SQL KMs listed in Table 3-1 are generic and can be used with any database technology. Technology-specific KMs, using features such as loaders or external tables, are listed in the corresponding technology chapter.

Table 3-1 Knowledge Modules to read from a File

Knowledge Module	Description
LKM File to SQL	Loads data from an ASCII or EBCDIC File to any ANSI SQL-92 compliant database used as a staging area.
IKM SQL to File Append	Integrates data in a target file from any ANSI SQL-92 compliant staging area in replace mode.
IKM File to File (Java)	Integrates data in a target file from a source file using a Java processing. Can take several source files and generates a log and a bad file. See Section 3.6.2.2, "IKM File to File (Java)" for more information.

3.2 Installation and Configuration

Make sure you have read the information in this section before you start working with the File technology:

System Requirements and Certifications
Technology Specific Requirements
Connectivity Requirements

3.2.1 System Requirements and Certifications

Before performing any installation you should read the system requirements and certification documentation to ensure that your environment meets the minimum installation requirements for the products you are installing.

The list of supported platforms and versions is available on Oracle Technical Network (OTN):

http://www.oracle.com/technology/products/oracle-data-integrator/index.html.

3.2.2 Technology Specific Requirements

Some of the knowledge modules for File data use specific features of the database. This section lists the requirements related to these features.

Database Utilities

Most database technologies have their own utilities for interacting with flat files. All require that the database client software is accessible from the Agent that runs the mapping that is using the utility. Some examples are:

Oracle: SQL*Loader
Microsoft SQL Server: bcp
Teradata: FastLoad, MultiLoad, TPump, FastExport

You can benefit from these utilities in Oracle Data Integrator by using the technology-specific knowledge modules. See the technology-specific chapter in this guide for more information about the knowledge modules and the requirements for using the database utilities.

Requirements for IKM File to File (Java)

The IKM File to File (Java) generates, compiles, and runs a Java program to process the source files. In order to use this KM, a JDK is required.

3.2.3 Connectivity Requirements

This section lists the requirements for connecting to flat files.

JDBC Driver

Oracle Data Integrator includes a built-in driver for flat files. This driver is installed with Oracle Data Integrator and does not require additional configuration.

3.3 Setting up the Topology

Setting up the topology consists in:

Creating a File Data Server
Creating a File Physical Schema

3.3.1 Creating a File Data Server

A File data server is a container for a set of file folders (each file folder corresponding to a physical schema).

Oracle Data Integrator provides the default FILE_GENERIC data server. This data server suits most of the needs. In most cases, it is not required to create a File data server, and you only need to create a physical schema under the FILE_GENERIC data server.

3.3.1.1 Creation of the Data Server

Create a data server for the File technology using the standard procedure, as described in "Creating a Data Server" of the Developer's Guide for Oracle Data Integrator. This section details only the fields required or specific for defining a File data server:

In the Definition tab:
- Name: Name of the data server that will appear in Oracle Data Integrator.
- User/Password: These fields are not used for File data servers.

In the JDBC tab, enter the following values:

JDBC Driver: com.sunopsis.jdbc.driver.file.FileDriver

JDBC URL: jdbc:snps:dbfile?<property=value>&<property=value>&...

You can use in the URL the properties listed in Table 3-2.

Table 3-2 JDBC File Driver Properties

Property	Value	Description
ENCODING	`<encoding_code>`	File encoding. The list of supported encoding is available at `http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html`. The default encoding value is `ISO8859_1`.
TRUNC_FIXED_STRINGS	`TRUE\|FALSE`	Truncates strings to the field size for fixed files. Default value is `FALSE`.
TRUNC_DEL_STRINGS	`TRUE\|FALSE`	Truncates strings to the field size for delimited files. Default value is `FALSE`.
OPT	`TRUE\|FALSE`	Optimizes file access on multiprocessor machines for better performance. Using this option on single processor machines may actually decrease performance. Default value is `FALSE`.

JDBC URL example:

jdbc:snps:dbfile?ENCODING=ISO8859_1&TRUNC_FIXED_STRINGS=FALSE&OPT=TRUE

3.3.2 Creating a File Physical Schema

Create a File physical schema using the standard procedure, as described in "Creating a physical schema" of the Developer's Guide for Oracle Data Integrator.

In your physical schema, you must set a pair of directories:

The Directory (Schema), where Oracle Data Integrator will look for the source and target files and create error files for invalid records detected in the source files.
A Directory (Work Schema), where Oracle Data Integrator may create temporary files associated to the sources and targets contained in the Data Schema.

Notes:

Data and Work schemas each correspond to a directory. This directory must be accessible to the component that will access the files. The directory can be an absolute path (m:/public/data/files) or relative to the runtime agent or Studio startup directory (../demo/files). It is strongly advised to use a path that is independent from the execution location.
In UNIX in particular, the agent must have read/write permission on both these directories.
Keep in mind that file paths are different in Windows than they are in UNIX. Take the platform used by the agent into account when setting up this information.

Create for this physical schema a logical schema using the standard procedure, as described in "Creating a Logical Schema" of the Developer's Guide for Oracle Data Integrator and associate it in a given context.

3.4 Setting Up an Integration Project

Setting up a project using the File database follows the standard procedure. See "Creating an Integration Project" of the Developer's Guide for Oracle Data Integrator.

It is recommended to import the following knowledge modules into your project for getting started:

LKM File to SQL
IKM SQL to File Append
IKM File to File (Java)

In addition to these knowledge modules, you can also import file knowledge modules specific to the other technologies involved in your product.

3.5 Creating and Reverse-Engineering a File Model

This section contains the following topics:

Create a File Model
Reverse-engineer a File Model

3.5.1 Create a File Model

An File model is a set of datastores, corresponding to files stored in a directory. A model is always based on a logical schema. In a given context, the logical schema corresponds to one physical schema. The data schema of this physical schema is the directory containing all the files (eventually in sub-directories) described in the model.

Create a File model using the standard procedure, as described in "Creating a Model" of the Developer's Guide for Oracle Data Integrator.

3.5.2 Reverse-engineer a File Model

Oracle Data Integrator provides specific methods for reverse-engineering files. File database supports four types of reverse-engineering:

Delimited Files Reverse-Engineering is performed per file datastore.
Fixed Files Reverse-engineering using the Wizard is performed per file datastore.
COBOL Copybook reverse-engineering, which is available for fixed files, if a copybook describing the file is provided. It is performed per file datastore.
Customized Reverse-Engineering, which uses a RKM (Reverse Knowledge Module) to obtain, from a Microsoft Excel spreadsheet, column definitions of each file datastore within a model and automatically create the file datastores in batch without manual input.

Note:

The built-in file driver uses metadata from the Oracle Data Integrator models (field data type or length, number of header rows, etc.). Driver-specific tags are generated by Oracle Data Integrator and passed to the driver along with regular SQL commands. These tags control how the driver reads or writes the file.

Similarly, when Oracle Data Integrator uses database loaders and utilities, it uses the model metadata to control these loaders and utilities.

It is important to pay close attention to the file definition after a reverse-engineering process, as discrepancy between the file definition and file content is a source of issues at run-time.

3.5.2.1 Delimited Files Reverse-Engineering

To perform a delimited file reverse-engineering:

In the Models accordion, right click your File Model and select New Datastore. The Datastore Editor opens.
In the Definition tab, enter the following fields:
- Name: Name of this datastore
- Resource Name: Sub-directory (if needed) and name of the file. You can browse for the file using the browse icon next to the field.
Go to the Files tab to describe the type of file. Set the fields as follows:
- File Format: Delimited
- Heading (Number of Lines): Enter the number of lines of the header. Note that if there is a header, the first line of the header will be used by Oracle Data Integrator to name the columns in the file.
- Select a Record Separator.
- Select or enter the character used as a Field Separator.
- Enter a Text Delimiter if your file uses one.
- Enter a Decimal Separator if your file contains decimals.
From the File main menu, select Save.
In the Datastore Editor, go to the Attributes tab.
In the editor toolbar, click Reverse Engineer.
Verify the datatype and length for the reverse engineered attributes. Oracle Data Integrator infers the fields datatypes and lengths from the file content, but may set default values (for example 50 for the strings field length) or incorrect data types in this process.
From the File main menu, select Save.

3.5.2.2 Fixed Files Reverse-engineering using the Wizard

Oracle Data Integrator provides a wizard to graphically define the columns of a fixed file.

To reverse-engineer a fixed file using the wizard:

In the Models accordion, right click your File Model and select New Datastore. The Datastore Editor opens.
In the Definition Tab, enter the following fields:
- Name: Name of this datastore
- Resource Name: Sub-directory (if needed) and name of the file. You can browse for the file using the browse icon next to the field.
Go to the Files tab to describe the type of file. Set the fields as follows:
- File Format: Fixed
- Header (Number of Lines): Enter the number of lines of the header.
- Select a Record Separator.
From the File main menu, select Save.
In the Datastore Editor, go to the Attributes tab.
In the editor toolbar, click Reverse Engineer.The Attributes Setup Wizard is launched. The Attributes Setup Wizard displays the first records of your file.
Click on the ruler (above the file contents) to create markers delimiting the attributes. You can right-click within the ruler to delete a marker.
Attributes are created with pre-generated names (C1, C2, and so on). You can edit the attribute name by clicking in the attribute header line (below the ruler).
In the properties panel (on the right), you can edit all the parameters of the selected attribute. You should set at least the Attribute Name, Datatype, and Length for each attribute.
Click OK when the attributes definition is complete.
From the File main menu, select Save.

3.5.2.3 COBOL Copybook reverse-engineering

COBOL Copybook reverse-engineering allows you to retrieve a legacy file structure from its description contained in a COBOL Copybook file.

To reverse-engineer a fixed file using a COBOL Copybook:

In the Models accordion, right click your File Model and select New Datastore. The Datastore Editor opens.
In the Definition Tab, enter the following fields:
- Name: Name of this datastore
- Resource Name: Sub-directory (if needed) and name of the file. You can browse for the file using the browse icon next to the field.
Go to the Files tab to describe the type of file. Set the fields as follows:
- File Format: Fixed
- Header (Number of Lines): Enter the number of lines of the header.
- Select a Record Separator.
From the File main menu, select Save.
In the Datastore Editor, go to the Attributes tab.
Create or open a File datastore that has a fixed format.
In the Datastore Editor, go to the Attributes tab.
In the toolbar menu, click Reverse Engineer COBOL CopyBook.
In the Reverse Engineer Cobol CopyBook Dialog, enter the following fields:
- File: Location of the Copybook file
- Character set: Copybook file charset.
- Description format (EBCDIC | ASCII): Copybook file format
- Data format (EBCDIC | ASCII): Data file format
Click OK.

The attributes described in the Copybook are reverse-engineered and appear in the attributes list.

Note:

If a field has a data type declared in the Copybook with no corresponding datatype in Oracle Data Integrator File technology, then this attribute will appear with no data type.

3.5.2.4 Customized Reverse-Engineering

In this reverse-engineering method, Oracle Data Integrator reads from a Microsoft Excel spreadsheet containing column definitions of each file datastore within a model and creates the file datastores in batch.

A sample file called file_repository.xls is supplied by ODI, typically under /demo/excel sub-directory. Follow the specific format in the sample file to input your datastore information.

The following steps assume that you have modified this file with the description of the structure of your flat files.

It is recommended that this file shall be closed before the reverse engineering is started.

To perform a customized reverse-engineering, perform the following steps:

Create an ODBC Datasource for the Excel Spreadsheet corresponding to the Excel Spreadsheet containing the files description.
Define the Data Server, Physical and Logical Schema for the Microsoft Excel Spreadsheet
Run the customized reverse-engineering using the RKM File from Excel RKM.

Create an ODBC Datasource for the Excel Spreadsheet

Launch the Microsoft ODBC Administrator.

Note that ODI running on 64-bit JRE will work with 64-bit ODBC only.
Add a System DSN (Data Source Name).
Select the Microsoft Excel Driver (*.xls, and *.xlsx etc.) as the data source driver.
Name the data source ODI_EXCEL_FILE_REPO and select the file /demo/excel/file_repository.xls as the default workbook. Be sure to select driver version accordingly. Example, "Excel 12.0" for ".xlsx" files.

Define the Data Server, Physical and Logical Schema for the Microsoft Excel Spreadsheet

In Topology Navigator, add a Microsoft Excel data server with the following parameters:
- Name: EXCEL_FILE_REPOSITORY
- JDBC Driver: sun.jdbc.odbc.JdbcOdbcDriver
- JDBC URL: jdbc:odbc:ODI_EXCEL_FILE_REPO
- Array Fetch Size: 0
Use default values for the rest of the parameters. From the File main menu, select Save.
Click Test Connection to see if the data sever connects to the actual Excel file.
Add a physical schema to this data server. Leave the default values in the Definition tab.

In the Context tab of the physical schema, click Add.
In the new line, select the context that will be used for reverse engineering and enter in the logical schema column EXCEL_FILE_REPOSITORY. This logical schema will be created automatically. Note that this name is mandatory.
From the File main menu, select Save.

Run the customized reverse-engineering

In Designer Navigator, import the RKM File (FROM EXCEL) Knowledge Module into your project.

Note:

If the EXCEL_FILE_REPOSITORY logical schema does not get created before the time of import, the customization status of the imported RKM will be "Modified by User". Upon the creation of EXCEL_FILE_REPOSITORY, it will be visible as source command schema under the corresponding RKM tasks.
Open an existing File model (or create a new one). Define the parameters as you normally will for a File model. Note that the Technology is File, not Microsoft Excel.
In the Reverse Engineer tab, set the following parameters:
- Select Customized
- Context: Reverse Context
- Knowledge Module: RKM File (FROM EXCEL)
In the toolbar menu, click Reverse Engineer.
You can follow the reverse-engineering process in the execution log.

Note:

The mandatory Microsoft Excel schema, EXCEL_FILE_REPOSITORY, is automatically used by RKM File (FROM EXCEL). It is independent from an actual File model using RKM File (FROM EXCEL).
Refer to Section 8.7.2, "Common Problems and Solutions" for information on mitigating common Excel-related ODBC exceptions.

3.6 Designing a Mapping

You can use a file as a source or a target of a mapping, but NOT as a staging area.

The KM choice for a mapping or a check determines the abilities and performances of this mapping or check. The recommendations below help in the selection of the KM for different situations concerning a File data server.

3.6.1 Loading Data From Files

Files can be used as a source of a mapping. The LKM choice in the Loading Knowledge Module tab to load a File to the staging area is essential for the mapping performance.

The LKM File to SQL uses the built-in file driver for loading data from a File database to a staging area. In addition to this KM, you can also use KMs that are specific to the technology of the staging area or target. Such KMs support technology-specific optimizations and use methods such as loaders or external tables.

This knowledge module, as well as other KMs relying on the built-in driver, support the following two features attached to the driver:

Erroneous Records Handling
Multi-Record Files Support

Erroneous Records Handling

Oracle Data Integrator built-in driver provides error handling at column level for the File technology. When loading a File, Oracle Data Integrator performs several controls. One of them verifies if the data in the file is consistent with the datastore definition. If one value from the row is inconsistent with the column description, the On Error option - on the Control tab of the Attribute Editor - defines the action to perform and continues to verify the remaining rows. The On Error option can take the following values:

Reject Error: The row containing the error is moved to a .BAD file, and a reason of the error is written to a .ERROR file.

The .BAD and .ERROR files are located in the same directory as the file being read and are named after this file, with a .BAD and .ERROR extension.
Null if error (inactive trace): The row is kept in the flow and the erroneous value is replaced by null.
Null if error (active trace): The row is kept in the flow, the erroneous value is replaced by null, and an reason of the error is written to the .ERROR file.

Multi-Record Files Support

Oracle Data Integrator is able to handle files that contain multiple record formats. For example, a file may contain records representing orders (these records have 5 columns) and other records representing order lines (these records having 8 columns with different datatypes).

The approach in Oracle Data Integrator consists in considering each specific record format as a different datastore.

Example 3-1 Multi Record File

This example uses the multi record file orders.txt. It contains two different record types: orders and order lines.

Order records have the following format:

REC_CODE,ORDER_ID,CUSTOMER_ID,ORDER_DATE

Order lines records have the following format

REC_CODE,ORDER_ID,LINE_ID,PRODUCT_ID,QTY

Order records are identified by REC_CODE=ORD

Order lines are identified by REC_CODE=LIN

To handle multi record files as a source of a mapping:

Create a File Model using a logical schema that points to the directory containing the source file.
Identify the different record formats and structures of the flat file. In Example 3-1 two record formats can be identified: one for the orders and one for the order lines.
For each record format identified, do the following:
1. Create a datastore in the File Model for each type of record.
  
  For Example 3-1 create two datastores.
2. In the Definition tab of the Datastore Editor, enter a unique name in the Name field and enter the flat file name in the Resource Name field. Note that the resource name is identical for all datastores of this model.
  
  For Example 3-1 you can use ORDERS and ORDER_LINES as the name of your datastores. Enter orders.txt in the Resource Name field for both datastores.
3. In the Files tab, select, depending on the format of your flat file, Fixed or Delimited from the File Format list and specify the record and field separators.
4. In the Attributes tab, enter the attribute definitions for this record type.
5. One or more attributes can be used to identify the record type. The record code is the field value content that is used as distinguishing element to be found in the file. The record code must be unique and allows files with several record patterns to be processed. In the Record Codes field, you can specify several values separated by the semicolon (;) character.
  
  In the Attribute Editor, assign a record code for each record type in the Record Codes field.
  
  In Example 3-1, enter ORD in the Record Codes field of the CODE_REC attribute of the ORDERS datastore and enter LIN in the Record Codes field of the CODE_REC attribute of the ORDER_LINES datastore.

With such definition, when reading data from the ORDERS datastore, the file driver will filter only those of the records where the first attribute contains the value ORD. The same applies to the ORDER_LINES datastore (only the records with the first attribute containing the value LIN will be returned).

3.6.2 Integrating Data in Files

Files can be used as a source and a target of a mapping. The data integration strategies in Files concern loading from the staging area to Files. The IKM choice in the Integration Knowledge Module tab determines the performances and possibilities for integrating.

Oracle Data Integrator provides two Integration Knowledge Modules for integrating File data:

IKM SQL to File Append
IKM File to File (Java)

3.6.2.1 IKM SQL to File Append

The IKM SQL to File Append uses the file driver for integrating data into a Files target from a staging area in truncate-insert mode.

This KM has the following options:

INSERT automatically attempts to insert data into the target datastore of the mapping.
CREATE_TARG_TABLE creates the target table.
TRUNCATE deletes the content of the target file and creates it if it does not exist.
GENERATE_HEADER creates the header row for a delimited file.

In addition to this KM, you can also use IKMs that are specific to the technology of the staging area. Such KMs support technology-specific optimizations and use methods such as loaders or external tables.

3.6.2.2 IKM File to File (Java)

The IKM File to File (Java) is the solution for handling File-to-File use cases. This IKM optimizes the integration performance by generating a Java program to process the files. It can process several source files when the datastore's resource name contains a wildcard. This program is able to run the transformations using several threads.

The IKM File to File (Java) provides two KM options for logging and error handling purposes: LOG_FILE and BAD_FILE.

This IKM supports flat delimited and fixed files where the fields can be optionally enclosed by double quotation marks ("). EBCDIC and XML formats are not supported.

Using the IKM File to File (Java)

To use the IKM File to File (Java), the staging area must be on a File data server. It is the default configuration when creating a new mapping. The staging area is located on the target, which is the File technology.

The IKM File to File (Java) supports mappings and filters. Mappings and filters are always executed on the source or on the staging area, never on the target. When defining the mapping expressions and filters use the Java syntax. Note that the mapping expressions and filter conditions must be written in a single line with no carriage return. The IKM supports the following standard Java datatypes: string, numeric, and date and accepts any Java transformation on these datatypes.

The following are two examples of a mapping expression:

FIC.COL1.toLower()
FIC.COL1+FIC.COL2

In the second example, if COL1 and COL2 are numeric, the IKM computes the sum of both numbers otherwise it concatenates the two strings.

The following are two examples of a filter condition:

FIC.COL1.equals("ORDER")
(FIC.COL1==FIC.COL2)&&(FIC.COL3 !=None)

The following objects and features are not supported:

Joins
Datasets
Changed Data Capture (CDC)
Flow Control
Lookups

Processing Several Files

The IKM File to File (Java) is able to process several source files. To specify several source files use wildcards in the datastore's resource name. You can use the PROCESSED_FILE_PREFIX and PROCESSED_FILE_SUFFIX KM options to manage the source files by renaming them once they are processed.

Using the Logging Features

Once the mapping is completed, Oracle Data Integrator generates the following output files according to the KM options:

Log file: This file contains information about the loading process, including names of the source files, the target file, and the bad file, as well as a summary of the values set for the major KM options, error messages (if any), statistic information about the processed rows.

Example 3-2 Log File

Source File: /xxx/abc.dat
Target File: /yyy/data/target_file.dat
Bad File: /yyy/log/target_file.bad

Header Number to skip: 1
Errors allowed: 3
Insert option: APPEND (could be REPLACE)
Thread: 1

ERROR LINE 100: FIELD COL1 IS NOT A DATE
ERROR LINE 120: UNEXPECTED ERROR

32056 Rows susccessfully read
2000 Rows not loaded due to data filter
2 Rows not loaded due to data errors

30054 Rows successfully loaded

Bad file: This file logs each row that could not be processed. If no error occurs, the bad file is empty.

KM Options

This KM has the following options:

JAVA_HOME indicates the full path to the bin directory of your JDK. If this options is not set, the ODI Java Home will be used.
APPEND appends the transformed data to the target file if set to Yes. If set to No, the file is overwritten.
DISCARDMAX indicates the maximum number of records that will be discarded into the bad file. The mapping fails when the number of discarded records exceeds the number specified in this option.

Note:

Rollback is not supported. The records that have been inserted remain.
MAX_NB_THREADS indicates the number of parallel threads used to process the data.
LOG_FILE indicates the log file name. If this option is not set, the log file name will be automatically generated and the log file will be written in the target work schema.
BAD_FILE indicates the bad file name. If this option is not set, the bad file name will be automatically generated and the bad file will be written in the target work schema.
SOURCE_ENCODING indicates the charset encoding for the source files. Default is the machine's default encoding.
TARGET_ENCODING indicates the charset encoding for the target file. Default is the machine's default encoding.
REMOVE_TEMPORARY_OBJECTS removes the log and bad files if set to Yes.
PROCESSED_FILE_PREFIX indicates the prefix that will be added to the source file name after processing.
PROCESSED_FILE_SUFFIX indicates the suffix that will be added to the source file name after processing.