1.3.7.3 Data Types Profiler

The Data Types Profiler analyzes the content of a number of attributes in order to assess whether or not the values conform to a consistent data type (that is, text, number or date).

Use the Data Types Profiler to gain an understanding of the types of data found in each attribute in your data, to assess whether the type of data is consistent, and in order to find values where the data type may be incorrect - for example because data was entered in the wrong field, or with the wrong type of data type constraint.

The Data Types Profiler looks for three basic types of data:

  • Dates, for any whole values that match a configurable list of date formats

  • Numbers, for any wholly numeric values (such as 12, 56.2, -0.087)

  • Text, for any other values, such as text strings, or a mixture of text and numerals.

Null values are counted separately from the above.

The following table describes the configuration options:

Configuration Description

Inputs

Specify any attributes that you want to analyze for data type consistency.

Options

Describes options you can specify.

List of recognized date formats

Recognizes dates in a variety of different formats. Specified as Reference Data (Date Formatting Category). Default value is *Date Formats (see Note).

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

None.

The Date Formats Reference Data used by the Data Type Check must conform to the standard Java 1.6.0 or later SimpleDateFormat API.

To understand how to add Reference Data entries for the correct recognition of dates, see the online Java documentation at http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html.

Note:

The valid date format yyyyMMdd, which is included in the date format reference data, is not recognized by this processor. This is because it contains no alpha characters or separators, and so cannot be distinguished from an eight-digit number.

Note:

The Data Types Profiler produces a percentage consistency statistic, which is calculated on the set of records input to the processor. In a real time monitoring process, this set is limited by the configurable commit point on the reader (defined as a number of transactions or as a time limit). If a process with a Data Types Profiler is executed as a real time response process, processing records 1 by 1, this consistency measure will always be 100%.

The following table describes the statistics produced by the profiler. In addition to the number of records analyzed, the following statistics are available in the Results Browser for each attribute:

Statistic Description

Text

The number of values that were recognized as having a textual format.

Date

The number of values that were recognized as having a date format.

Number

The number of values that were recognized as having a number format.

% Consistency

A calculation of the consistency of the data types in each attribute - that is, the percentage of values that were recognized as matching the most common data type.

Examples

In this example, the Data Types Profiler is run on all attributes in a table of Customer records:

Table 1-121 Data Types Profiler Example

Input Field Total number Text Format Numeric Format Date/time Format Null values Consistency %

CU_ACCOUNT

2001

2000

0

0

1

>99.9

TITLE

2001

1862

0

0

139

93.1

NAME

2001

2000

0

0

1

>99.9

GENDER

2001

1853

0

0

148

92.6

BUSINESS

2001

1670

0

0

331

83.5

ADDRESS1

2001

1999

0

0

2

>99.9

ADDRESS2

2001

1922

0

0

79

96.1

ADDRESS3

2001

1032

0

0

969

51.6

POSTCODE

2001

1765

0

0

236

88.2

EMAIL

2001

1936

0

0

65

96.8

ACC_MGR

2001

1996

0

0

5

99.8

DT_PURCHASED

2001

0

0

1998

3

99.9

DT_ACC_OPEN

2001

0

0

1998

3

99.9