1.3.11 Transformation Processors

Transformation processors take one or more input attributes, transform them, and output the transformed values in new attributes.

It is important to understand that transformers in EDQ never change the input data directly. EDQ allows you to see the effects of any transformations you apply before deciding how to use the transformed data. You may choose to use the transformed data in preference to the original data, for example before writing data out from a data cleansing process.

The most common use of transformation processors is to transform data before it is migrated to a new system, or for further data quality analysis, for example, before auditing or matching it. Transformation processors may therefore be used at any point in the process flow. You may decide, for example, to transform all text data to upper or lower case before performing any analysis, so that you are always insensitive of case.

Often, the transformations that you need to apply to the data are discovered during profiling and auditing. EDQ therefore allows you to build transformation rules directly from the data itself. For example, you might find a set of records with an invalid value for an attribute. You can then create a Reference Data map directly from the data in order to replace the bad values with their corrected versions. You can then configure a Replace processor to use your new Reference Data map and create a new attribute with the bad values replaced.

The attributes that a transformation processor create may be Derived or Added, depending on the processor. It is important to understand this difference, as it affects the way your data flows work.

Derived Attributes

Derived Attributes are created by transformers that process each input attribute separately, and produce a new, transformed version of each input attribute. The new derived attribute will contain a transformed version of the data in the input attribute. Derived Attributes are always named in the default format [Input Attribute Name].Transformation, for example, Forename.Upper.

Table 1-133 Derived Attributes

Processor Creates Derived Attribute with default name

Upper Case

[Attribute Name].Upper

Trim Whitespace

[Attribute Name].Trimmed

Denoise

[Attribute Name].Denoise

Trim Characters

[Attribute Name].Substring

Replace

[Attribute Name].Replaced

Proper Case

[Attribute Name].Proper

When an attribute is transformed by a processor that adds a Derived attribute, the output attribute is named to reflect the transformation.

Downstream processors will use the latest value of the attribute for its input attribute, by default. For instance, if you insert a Denoise processor between the Reader and the Upper Case processor, the NAME attribute used as the input for the Upper case Processor will be the NAME.Denoise version of the attribute, rather than the original NAME attribute.

A blue arrow indicates that the latest version of the attribute will be used, including all the transformations that the attribute has undergone.

This means that you do not necessarily have to get the order of processing right first time. Inserting an interim transformation before another can often be done without affecting any other processors.

Derived Attributes are displayed in the Results Browser next to the attributes that they were derived from, even if the name of the Derived Attribute is renamed from its default name format (for example, NAME.Upper is renamed to New_name).

Defined attributes are indicated by a filled green circle. These refer to specific versions of the attributes, such as NAME.Denoise, rather than the latest version of an attribute.

Note:

It is possible to select a defined attribute as the input for a downstream processor, rather than the latest version. In the processor configuration, under the blue arrow icon the user can expand each attribute to view the defined attributes which are available. In the example above NAME (the original source attribute) and NAME.Denoise are available. Any of the listed attributes may be selected as an input for the processor.

Added Attributes

Added attributes are created by transformers where the new attribute is not directly related to a single input attribute, or if there is a change of data type. Added attributes are created in the following cases:

  • More than one input attribute is used in the transformation (for example, in a concatenation)

  • More than one output attribute is created from the same input attribute (for example, in a split)

  • The data type of the input attribute is changed (for example, in a data type conversion)

Added Attributes are assigned a default name according to the transformation operation. For example, Concat is used for a concatenation. Examples of processors that add Added Attributes include:

Table 1-134 Added Attributes

Processor Creates Added Attribute with default name

Concatenate

Concat

Make Array from Inputs

Array

Multiply

MultipliedValue

Add

AddedValue

Make Array from String

ArrayFromString

Output Attribute Naming

If you configure a processor that adds an attribute - either Derived or Added - that is named in the format [Input Attribute].[Output], the output attribute(s) created by the processor will be renamed if the input attributes are changed. This applies to all processors that add Derived attributes, and also some processors that add Added Attributes, where the outputs are related to the input attribute(s), but where there is a reason not to add a derived attribute. This is normally because there has been a change of data type, meaning that an Added, rather than Derived attribute has to be created, because otherwise the inputs to downstream processors could be invalidated.

This applies to the following processors:

Table 1-135 Output Attribute Naming

Processor Creates Added Attribute with default name

Convert Number to String

[Input Attribute].NumberToString

Convert Date to String

[Input Attribute].DateToString

Convert String to Date

[Input Attribute].StringToDate

Convert String to Number

[Input Attribute].StringToNumber