1.6 Reference Data

Reference Data is data that is used in lookups by various processors when checking and improving working data. Examples of Reference Data include:

  • Lists of valid or invalid values, characters or patterns

  • Maps used to standardize words, replace characters, or generate patterns

Each set of Reference Data may be created, edited and managed in EDQ itself, or may be from an external source. For example, a file that is stored and updated on the internet may be downloaded, made into a snapshot and used as Reference Data, or you may choose to maintain your own database of Reference Data and perform lookups against this database.

Reference Data that is managed in EDQ may be also used in processes in the same way as Staged Data. It can be profiled, checked, transformed, matched and so on.

There are two aspects of Reference Data definition:

  • The data itself

  • A lookup definition, defining how to perform lookups onto that data; that is, which columns to use for lookups, and which columns (if any) to return

When creating a set of Reference Data, the New Reference Data option will create both a set of data (to be managed in EDQ) and a default lookup definition for that data. The New Lookup option will create a lookup onto an existing set of data, which may be from one of three sources:

  • An existing set of Reference Data (where you want to use a different lookup definition to the default)

  • Staged Data (either a snapshot or a set of staged data written from a process)

  • External Data (using one of the configured server-side Data Store connections)

When using the Reference Data in a processor option, there is no difference between Lookups and Reference Data.

Reference Data Managed in EDQ

When using lists and maps of data that are used to validate values and patterns, that will normally be small enough to load into memory (see note below), and that you may need to create or update using results, it is advisable to manage these sets of data in EDQ.

Note:

As a guide, any Reference Data set with fewer than 50,000 rows should be loadable into memory on an EDQ server with the recommended minimum of 1GB RAM, and so will be marked as loadable when selecting Reference Data for use in processors. Reference Data sets that are larger than this will by default not be loadable into memory, but if you know you do have more memory available, it is possible for an administrator to change the 50,000 row limit on the server.

For example, the following types of Reference Data would normally be managed in EDQ:

  • Lookup lists of valid and invalid values, patterns, and regular expressions, used to check data

  • Standardization maps, used to transform data

  • Character maps used to generate patterns

  • Date and number format lists used to recognize and convert dates and numbers

A starter pack of Reference Data is shipped with EDQ, though new Reference Data can be created and modified quickly and easily from your own data, using the Results Browser.

Reference Data Categories

When creating a Reference Data set that is managed by EDQ, you can optionally assign it a Category.

Categories are used to provide shorter lists of Reference Data sets when selecting Reference Data from processors, where the processor option requires a certain 'type' of Reference Data, such as a list of characters, patterns, or regular expressions.

The following categories are all used by processors in the Processor Library, and are therefore available for selection when creating a Reference Data set. If new processors are created and added to the Processor Library, these may add further categories which will also appear in the list.

Staged Data Lookups

Staged Data Lookups are lookups onto an existing set of staged data in the repository (either a Snapshot, or data that has been written from another process).

When setting up a Staged Data Lookup, you must choose which column or columns to use for the lookup, and which columns you want to return.

You may configure several different lookups onto the same data, using different lookup and return columns.

Staged Data Lookups appear under the Reference Data node in the Project Browser, but with a Staged Data icon to indicate that the lookup is onto Staged Data rather than editable Reference Data or External Data.

External Data Lookups

External Data Lookups are lookups onto some data that you do not have staged, and that you do not want to stage, for example, a large data set that exists externally to EDQ, and may be frequently updated.

An External Data Lookup is configured in the same way as a Staged Data Lookup, with selected columns used for the lookup, and selected columns returned. However, the external data set is not staged in the EDQ repository.

You may configure several different lookups onto the same data, using different lookup and return columns.

External Data Lookups appear under the Reference Data node in the Project Browser, but with the Data Store icon to indicate that the lookup is onto External Data rather than editable Reference Data or Staged Data.

Reference Data Levels

Reference Data may exist at two different levels. System-level Reference Data is globally shared on a server, and may be used in many projects. Project-level Reference Data may only be used in the project where it is stored.