1.5 Data Stores

A Data Store is a connection to a store of data, whether the data is stored in a database or in one or more files. The data store may be used as the source of data for a process, or you may export the results of a process to a data store, or both.

It is normally recommended to connect to the data store via the server. When connecting to files, this means that the files must exist in the server landing area to ensure that the server will be able to access them. You can use File Download tasks, or use EDQ's SFTP interface, to download files from external locations into the server landing area for processing, and use File Upload tasks to transfer server landing area files to external locations after processing. However, it is also possible to pull the data onto the server using a client connection. See Client-side Data Stores for more details.

Note:

You can't add snapshots from, or exports to, client side data stores to jobs because the client connection used to import or export the data is not available to the server.

EDQ supports native connections to the following types of data stores:

Databases

  • Apache Hive

  • AWS Redshift

  • Cassandra

  • DB2

  • DB2 for i5/OS (see below for further details)

  • JNDI Datasources

  • Microsoft Access

  • Microsoft SQL Server

  • MySQL

  • Oracle

  • Oracle Autonomous Database (ADW or ATP)

  • Oracle Autonomous Database (ADW or ATP) with Uploaded Wallet

  • PostgreSQL

  • Snowflake

  • Sybase

  • Sybase Adaptive Server Enterprise

  • Sybase IQ

  • Sybase SQL Anywhere

  • Teradata Database

Data Files

  • Delimited Text Files

  • Delimited Text File Directory

  • Fixed Width Text Files

  • Apache Avro Files

  • JSON Files

  • JSON Lines Files

XML Files

  • XML and Stylesheet
  • Simple XML Files

MS Office Files

  • Microsoft Access
  • Microsoft Excel

System Information

  • Cases
  • User lists
  • Event log

Applications

  • Oracle Service Cloud

Other

  • JDBC connection

  • DB files (*.jmp)

Notes:

  • The JNDI Datasource option is used to connect to a preconfigured connection on an application server, such as Weblogic.

  • When exporting data to Cassandra, EDQ does not allow you to create a new table in the Cassandra database. However, you can export to existing tables.

  • Oracle Service Cloud data store is supported only on the Server. After registering a Service Cloud data store for the first time, it may take a few minutes for the list of available tables to be available in the New Snapshot wizard. The Oracle Service Cloud data store uses a driver that is suitable only for data extraction, so it is not possible to export data for this type of data store. Updates to data in applications can often be made using the Call External Web Service processor .

Connecting EDQ to Oracle Databases

By default, EDQ connects to an Oracle database using a direct JDBC connection. It is also possible to connect to Oracle via a JNDI connection configured on the application server, or to configure the EDQ server to use connections via connection strings specified in tnsnames.ora files or on an LDAP server.

These alternative methods may be useful if connecting to a logical schema on a RAC-enabled database, such as Oracle Exadata. For details of how to enable configure EDQ to enable connections to Oracle via tnsnames or LDAP, see Administering Oracle Enterprise Data Quality guide in the EDQ Documentation library.

Connecting EDQ to DB2 for i5/OS

To connect EDQ to a DB2 for i5/OS database, download the DB2 for iSeries driver.

Use the following procedure:

  1. Download Toolbox for Java/JTOpen from http://jt400.sourceforge.net/

  2. Extract the downloaded file. The filename will be in the format (jtopen_N_n.zip), where N and n represent the first and second digits of the version number.

  3. Create a folder called db2i5 in the oedq_local_home/dbconnectors directory.

  4. Copy in the jt400.jar file from the zip file extracted in step 2, and copy it to the db2i5 folder.

  5. Restart the EDQ (Datanomic) application server.

Connecting EDQ to Teradata Database

To connect EDQ to a Teradata database, download the JDBC driver jars.

Use the following procedure:

  • Download the driver jars terajdbc4.jar and tdgssconfig.jar from Teradata.

    Note:

    tdgssconfig.jar is not part of the driver in Teradata Data Connector version 16.20.00.11 and later. It is still required for earlier versions though.
  • Create a directory named teradata in the oedq_local_home/dbconnectors directory.
  • Copy the jars to this new directory.
  • Restart the EDQ application server.

Text Files

  • Text Files (that is, CSV files, or other delimited text files);

  • A directory of delimited text files;

  • Fixed width Text Files.

  • JSON Files.

XML Files

  • XML with Stylesheet (that is, XML files with an XSLT stylesheet file which defines how to translate the XML file into a format understood by EDQ)

  • Simple XML Files (that is, XML files with a simple 2-level structure where the top level tag represents the entity, and the lower level tags represent the attributes of the entity. XML files exported from Microsoft Access are an example.)

Connecting to an Access Database

EDQ uses a pure Java solution to connect to Access databases and is available on all platforms. Advanced features such as SQL entry or custom WHERE clauses are not supported.

Connecting EDQ to an Excel Data Store

You can connect to an Excel data store through the server or the client.

For connecting EDQ to an Excel Data Store, configure the following connection details:

  • File in server work area - Specify the file that is available on the server landing area and you wish to process (read or write) by connecting to EDQ. Default location where the files exist on the server for the EDQ to process.

    Note:

    This option is available only for server-side connection.

    For client-side connections, select the file that you wish to process (read or write) by selecting it from your local computer. The selected file path is displayed in the File text box. See Client-side Data Stores for more details.

  • Use project specific landing area - Select this check box if you wish to use a project specific landing area on the server for your processed files, instead of the above default location.

    Note:

    This option is available only for server-side connection.
  • Always overwrite file (stream data) on export - Select this check box if you always want to overwrite the Excel file (stream data) on export. Streaming mode uses significantly less memory when writing large XLSX files, but does not preserve worksheets and does not support append mode.
  • To set attribute types in an Excel spreadsheet, click configure worksheets. It allows you to set the following attribute types:
    • String
    • Number
    • Date
    • Select 1st row is header check box if you wish to use the first row of the selected worksheet as header to create the column names in EDQ.

    After setting the required attributes, click OK.

  • Click Test, to check the connection to the new Excel data store.
  • Upon successful test connection, click OK.

The new Excel data store is now configured and visible in the Project Browser. Alternatively, if the data store is going to be shared across projects, you can create it at the System level (outside of any specific project) in the same way as above.

Other

EDQ supports connectivity to other data stores using standards such as JDBC.

See Oracle's JDBC FAQ for information to help you connect to a data store using a JDBC URL.

The connection details to a Data Store are saved on the EDQ server. Any user with access to that server, and the relevant project, can then use that data store.

Connecting to Oracle Autonomous Database (ADW or ATP)

Oracle Autonomous Database is a fully managed, preconfigured database environment with two workload types available, Autonomous Transaction Processing (ATP)and Autonomous Data Warehouse (ADW).

A connection to ADW or ATP requires a wallet file for additional security and a tnsnames.ora file containing the connection information. EDQ provides two options for creating a data store for ADW or ATP, which differ in the process used to provide the required files.

Oracle Autonomous Database (ADW or ATP)

When you select this type of data store, EDQ automatically downloads the wallet and tnsnames.ora files, using OCI REST APIs.

To configure a data store, select Oracle Autonomous Database (ADW or ATP) in the Server/Database category and enter the following information:

  1. OCID of Instance- The OCID of the database instance.

    For example,

    ocid1.autonomousdatabase.oc1.phx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    Note:

    You can copy the OCID for an autonomous database from the OCI Console.
  2. OCI Credentials- Stored credentials for a user who can connect to REST APIs to download the wallet. The user must have minimum permissions to get the wallet and must be able to read information on the database. These permissions can be granted by using the following OCI policy statements:
    Allow group MYGROUP to manage autonomous-databases in compartment MYCOMP where request.operation='GenerateAutonomousDatabaseWallet'
    Allow group MYGROUP to inspect autonomous-databases in compartment MYCOMP

    Replace the group name and compartment with the correct values for your tenancy.

    If you are running EDQ on an OCI compute instance, you can configure the instance itself with the required permissions. Follow the instructions listed in the OCI documentation - Calling Services from an Instance and select OCI Instance Principal Authentication as the credentials. If you are using instance authentication you need not create stored credentials.

    Note:

    If you have provisioned your EDQ instance from Oracle Cloud Marketplace, you may have already set the above policy statements. These statements are required to discover and set up an autonomous database as the EDQ repository.
  3. Resource Level- The resource level (Low, Medium, High) used for connections to the database. For typical EDQ use cases Oracle recommends the default setting, Low.
  4. DB User and DB Password - Database user credentials.
  5. Schema - Database schema. If you leave this blank the default schema for the user is used.
  6. Proxy Host and Proxy Port - If the EDQ server is running on a system which requires a proxy server to access OCI services, enter the information in this field. If EDQ is running on an OCI compute instance which cannot access the database directly, Oracle recommends setting up a Service Gateway as an alternative to proxy. For more details, refer to Access to Oracle Services: Service Gateway. Once you have completed the configuration, click Test to verify the connection.

    Note:

    If you have not entered proxy information when a proxy is required, the test may take a significant time before generating an error and in extreme cases may result in a client disconnect.

Oracle Autonomous Database (ADW or ATP) with uploaded wallet

To use this type of data store, you have to download the wallet information from the ADW or ATP administration console, and then upload the wallet and tnsnames.ora files to the EDQ landing area.

Perform the following steps to download the wallet files :

  1. Login to the OCI console and navigate to the ADW or ATP instance that you are using.
  2. Click the Service Console button and select Administration.
  3. Click Download Client Credentials (Wallet).
  4. Enter and confirm a password and click Download. Once download is complete you will not use the password again.
  5. Save the resulting ZIP file.

After downloading the wallet ZIP, expand it and upload the cwallet.sso and tnsnames.ora files to a folder in the EDQ landing area. You can use the SFTP server built in to EDQ or copy the files directly to the server. The other files in the ZIP are not required.

Note:

If EDQ is running on a server which requires a proxy to access the database, you will need to edit the tnsnames.ora file manually to specify the proxy before uploading to the landing area. For instructions on configuring the proxy refer to - JDBC Thin Connections with an HTTP Proxy.

To configure a data store, select Oracle Autonomous Database (ADW or ATP) with uploaded wallet in the Server/Database category and enter the following information:

  1. Service Name - The database service name, as listed in the file tnsnames.ora. This will be DBNAME_LEVEL where DBNAME is the database name and LEVEL is the resource level. For typical EDQ use cases, oracle recommends you to configure the level as low.
  2. Landing area folder containing tnsnames.ora - The folder in the landing area containing the uploaded cwallet.sso and tnsnames.ora files. If the files are in the top level of the landing area, leave this field blank.
  3. User name and Password - Database user credentials.
  4. Schema - Database schema. If you leave this field blank, EDQ uses the default schema for the user.
Once you have completed the configuration click Test, to verify the connection.

Note:

If you have not entered proxy information when a proxy is required, the test may take a significant time before generating an error and in extreme cases this may result in a client disconnect.

Connecting EDQ to AWS Redshift

AWS Redshift is a fast, simple, cost-effective data warehousing service. EDQ connects to AWS Redshift using a local database connector with an XML file and the standard JDBC driver that AWS makes available.

To enable AWS Redshift connectivity, use the following procedure:

  1. Download the JDBC 4.2 no-sdk driver from S3:

    https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html

  2. Create a directory named "redshift" in the oedq.local.home/dbconnectors directory, and copy the jar file to this new directory. The Redshift connector will then be available.

Connecting EDQ to JSON Data Stores

EDQ can read data from an array of objects in a JSON file. The array can be at the top level of the file, or can be reached via an attribute path.

String, number, boolean (true or false) and null values are supported in objects. Nested objects and arrays are ignored. For a snapshot definition, column names are determined by finding the unique attribute names in the first 1000 (or fewer) JSON objects in the input array. For exports, EDQ date attributes are converted to ISO-8601 date time strings in the UTC time zone, such as "2001-10-17T23:00:00.000Z".

The JSON data store type is found in the Text Files category in the New Data Store wizard. Files may be read from the server landing area or a local file on the client computer.

The following options are available when connecting EDQ to JSON data stores:

  • JSON path to array: If the object array is not at the top level of the file, you must enter the attributes required to reach the array. Attributes are separated by dots (.).

    For example, consider the following JSON file:

    { "item": 
       { "list": [
          { "a" : 1, "b": "string", "c": true},
          { "a" : 1, "b": "string", "c": true}
         ]
       }
    }
    

    In this case, the JSON path is item.list.

  • Return all values as strings: In this case, all columns are defined as strings, irrespective of the values found in the initial 1000 object scan. This may be necessary if a JSON attribute can contain either numbers or strings. If the first 1000 objects contained only numeric values, EDQ would return a numeric column. But if an object after the first 1000 contained a non-numeric value for the attribute, null would be returned. Setting this option means that all values are assumed to be strings.

Connecting EDQ to JSON Lines Data Stores

JSON Lines, also known as newline-delimited JSON, is a variant of JSON where each line in a file is a single JSON array or object. JSON Lines files are often used for data transfer from or to cloud services such as Google Big Query. For more information on the format, refer to JSON Lines documentation.

When EDQ reads a JSON lines file, the structure is derived by scanning the first 1000 lines. If the lines contain JSON arrays, attributes are named Column1, Column2, etc. If the lines contain JSON objects, attribute names are copied from the objects. A file may not contain a mixture of arrays and objects. When exporting to a JSON Lines file, you must select either the array or the object format.

To configure an EDQ Data Store to read from or write to a JSON Lines file, select JSON Lines files from the Client or Server Data Files category and enter the following information:

  1. File in server work area - For server side connections, enter the path of the file in the EDQ landing area.

    or

    File - For client side connections, browse for the file on your local computer.

  2. Export records as - Select Arrays or Objects to define the format of each line when a JSON Lines file is written.
  3. Return all values as strings - In this case, all columns are defined as strings, irrespective of the values found in the initial 1000 line scan. This may be necessary if a JSON attribute can contain either numbers or strings. If the first 1000 line contained only numeric values, EDQ would return a numeric column. If then an object after the first 1000 contained a non-numeric value for the attribute, null would be returned. Setting this option means that all values are assumed to be strings.
  4. Use project specific landing area - As with other server-based file data stores, select this option if the file is located in the landing area directory specific to the current project.

Example line formats:

  • Export as arrays -
    [1, "Othello", "Shakespeare"]
    [2, "The Importance of Being Earnest", "Wilde"]
  • Export as objects -
    {"id": 1, "title": "Othello", "author": "Shakespeare"}
    {"id": 2, "title": "The Importance of Being Earnest", "author": "Wilde"}

Connecting EDQ to Apache Hive using Kerberos Authentication

  • EDQ Server Configuration -
    For connecting EDQ to Apache hive, you have to configure the server running EDQ to support Kerberos. This requires a valid Kerberos configuration file containing the realm used with Hive. The default location of the configuration file is:
    For Linux: /etc/krb5.conf
    For Solaris: /etc/krb5/krb5.conf
    For Windows: \Windows\krb5.ini

    See Kerberos Requirements for more details.

  • Setting Up the Connection -

    To set up a connection from EDQ to Hive with Kerberos, configure the following connection details:

    1. Database Host - Address of the Hive Server. For example - hive001.example.com.
    2. Port - Listening port of the Hive Server.
    3. User Name and Password - The user name and password of a Kerberos user (principal) that has been enabled for Hive connectivity.
    4. Kerberos Service - The Kerberos principal name associated with the Hive Server. This is normally hive/hostname@REALM, and is defined as hive.server2.authentication.kerberos.principal in the Hive server hive-site.xml configuration file. Enter the Hive Service name in the Kerberos Service field. For example - hive/hive001.example.com@EXAMPLE.COM. In this example, the Kerberos realm is assumed to be EXAMPLE.COM.
    5. Schema - Enter the schema to use to access data in the database.

      Note:

      You can leave this field blank to use the default schema for the user.
  • Troubleshooting -

    Common errors which may occur when configuring Hive connections with Kerberos are:

    1. Client not found in Kerberos database - It implies the user name or Kerberos service name is incorrect. Check the correct value with the Kerberos/Hive administrator.
    2. Checksum failed - It implies the password is incorrect.
    3. Pre-authentication information was invalid: It implies the password is incorrect (this error occurs when Windows Active Directory is used).

Connecting EDQ to Apache Avro™ Data Stores

Apache Avro™ is a data serialization and storage format often used in conjunction with Big Data systems. Avro is a supported import/export format for Google Big Query. Every Avro file contains an embedded schema describing the exact data shape. This means you never need to infer the data types from the data, unlike the case with formats like CSV.

For more information on the format, refer to Apache Avro documentation. The EDQ data store implementation is based on the 1.10.0 specification listed at Apache Specifications documentation.

EDQ supports compressed Avro files. For snapshots, EDQ supports the following codecs:

  • deflate
  • bzip2
  • snappy
  • xz
  • zstandard

For exports, the only supported codec is deflate.

Supported Data Stores

The schema in Avro files used with the EDQ data store must define the data as a record type. The record field names map to attributes in EDQ snapshots and exports.

The supported field types are: null, boolean, int, long, float, double, string, enum and union.

  • boolean values map to numbers with value zero or one.
  • enum types map to strings.
  • union types are supported if each of the constituent types are supported and each type maps to the same EDQ type. For example a union with string and long types is not supported.
  • long types with an associated logical type of timestamp-millis, timestamp-micros, local-timestamp-millis or local-timestamp-micros map to date values in EDQ.

In addition, bytes and fixed types are supported with an associated logical decimal type.

Exporting to Apache Avro™

The names allowed for Avro record field names are restricted to letters, digits and underscores. EDQ attribute names are mapped to Avro field names by replacing spaces with underscores and removing other invalid characters. If the mapping results in duplicated names, numeric suffixes are added to resolve the duplication.

For example the EDQ names "increase %" and "increase $" are mapped to "increase_" and "increase_2".

EDQ has a single datatype for numbers and a single type for dates, whereas Avro can use a number of different formats for each. The definition of an Avro data store allows you to select the type used for numbers (int, long, or double), and select the type used for dates – Timestamps (long with associated timestamp-micros logical type) or Strings (ISO-8601 format with UTC timezone). All values are stored as Avro union types with null and the raw type as options.

Alternatively an Avro schema file may be specified to give more control over the output types. A schema is defined in a JSON file, conventionally with an extension .avsc. For example, a schema for a simple export of item data could be:

{
  "type": "record",
  "name": "items",
  "fields": [
    { "name": "id",          "type": "int" },
    { "name": "code",        "type": "string" },
    { "name": "description", "type": [ "null", "string" ] },
    { "name": "cost",        "type": "double" },
    { "name": "available",   "type": { "type": "long", "logicalType": "local-timestamp-micros" }},
    { "name": "taxexempt",   "type": "boolean" }
  ]
}

Any attribute which may contain NULL values must be defined as a null or a union with null as one of the possible types. See the field description in the above example. An error will occur if a NULL value is exported to a field which does not support nulls, or a non-NULL value is exported to a field which supports nulls only.

Any field which is defined in the schema but not part of the export must support nulls.

If an export to Avro is run in append mode, and the output file exists and is not empty, the schema is read from the file and any schema defined with the data store is ignored.

Defining an Apache Avro™ Data Store

To configure an EDQ Data Store to read from or write to an Avro file, select Apache Avro from the Client or Server Data Files category and enter the following information:

  1. File in server work area- For server side connections, enter the path of the file within the EDQ landing area.

    or

    File: For client side connections, browse for the file on your local computer.

  2. Schema file for export- If you wish to override the default schema EDQ uses for exports, enter the landing area path (server) or file name (client) of the Avro schema file to use.
  3. Export numbers as- Select the Avro type to be used when exporting numbers. EDQ ignores your selection if you define a schema file or when appending to an existing file.
  4. Export dates as- Select the format to be used when exporting dates. EDQ ignores your selection if you define a schema file or when appending to an existing file.
  5. Compress on export- Enable this option to export data compressed with the deflate codec.
  6. Use project specific landing area- As with other server-based file data stores, select this option if the file is located in the landing area directory specific to the current project.

Connecting EDQ to Snowflake

Note:

This feature is applicable for EDQ 12.2.1.4.2 and later releases.
  • Installing the driver -

    Download the latest version of the Snowflake JDBC driver. See https://docs.snowflake.com/en/user-guide/jdbc-download.html for details. After the download completes, copy the jar file to the directory dbconnectors/snowflake in the EDQ "local" configuration directory. Create this folder if it does not exist. Restart the EDQ server. The Snowflake data store type will be available for use.

  • Setting Up the Connection -

    To set up a connection from EDQ to Snowflake, configure the following connection details:

    1. Account - The Snowflake account (required)
    2. Database: The database name (required)
    3. Warehouse: The warehouse name
    4. Schema: The database schema. Entering a schema value reduces the number of tables shown when you configure a snapshot.

      Note:

      You can leave this field blank to use the default schema for the user.
    5. User name: The username of the Snowflake user.
    6. Password: The password of the Snowflake user.
    7. Proxy host and proxy port: If a proxy is required to access Snowflake servers, enter the details in these fields.

      Note:

      If a proxy is required, but you do not fill in the details in these fields the Snowflake driver will wait indefinitely while trying to make a connection.