1.3.7.11 Quickstats Profiler

The Quickstats Profiler provides fundamental quality metrics for a number of records or transactions, highlighting:

  • Candidate key columns

  • Completeness and missing data

  • Duplication

  • Uniqueness and diversity of values

Each input attribute is profiled individually.

Quickstats is useful to establish a picture of some of the fundamentals of data and its quality.

Often documentation and meta-data information are missing, incomplete, out of date, or not trusted. It is important to produce an unequivocal picture of the data, from the data itself, so that mistakes are not made inadvertently through false assumptions.

The following table describes the configuration options:

Configuration Description

Inputs

Specify any attributes from which you want to obtain quick profiling statistics.

Options

None.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

The following flag is output:

  • [Attribute name].Populated: indicates which reports are populated. Possible values are Y/N.

The Quickstats Profiler requires a batch of records to produce its statistics (for example, in order to tell how many duplicate values there are for each attribute analyzed). It must therefore run to completion before its results are available, and is not suitable for a process that requires a real time response.

When executed against a batch of transactions from a real time data source, it will finish its processing when the commit point (transaction or time limit) configured on the Read Processor is reached.

The following table describes the statistics produced by the profiler for each attribute:

Statistic Description

With data

The number of records with data in that attribute.

Without data

The number of records without data in that attribute. This includes records that had a NULL value, and those that contained other types of No Data, such as only white space or non-printing characters. Drill down on the number to see a breakdown of the types of No Data found.

Singletons

The number of records with values that were found only once in that attribute.

Duplicates

The number of records with values that were found more than once in that attribute.

Distinct

he number of different values that were found in the attribute. Drill down on the number to see a breakdown of these values by their frequency of occurrence.

Comments

Automated comments based on the findings of the Quickstats profiler. See below.

Clicking on the Additional Information button will show the above statistics as percentages of the total number of records analyzed.

Automated Comments

Automated Comments are generated in order to highlight potential areas of interest in the data. For example:

  • Where an attribute is 100% complete and unique, it is identified as a possible key

  • Where an attribute is nearly 100% complete and unique it is highlighted as a possibly damaged key

  • Where an attribute is nearly 100% complete (suggesting blanks are not expected), the comment prompts the user to investigate nulls

  • Where an attribute is nearly 100% unique (suggesting duplicates are not expected), the comment prompts the user to investigate duplicates

  • Where an attribute has only one distinct value, the comment suggests that the attribute may be redundant

Note that where many of the above comments apply, the comments are concatenated.

Example

In this example, the Quickstats Profiler is used to gain an initial overview of a table of Customer records.

Table 1-123 Quickstats Profiler Example

Input Field Record Total With Data Without Data Singletons Duplications Distinct Values

CU_NO

2001

2000

1

1997

3

1998

CU_ACCOUNT

2001

2000

1

2000

0

2000

TITLE

2001

1862

139

3

1859

8

NAME

2001

2000

1

1980

20

1990

GENDER

2001

1853

148

0

1853

2

BUSINESS

2001

1670

331

1629

41

1649

ADDRESS1

2001

1999

2

1926

73

1954

ADDRESS2

2001

1921

80

554

1367

839

ADDRESS3

2001

1032

969

278

754

379

POSTCODE

2001

1762

239

1604

158

1672

AREA_CODE

2001

1884

117

64

1820

270

TEL_NO

2001

1994

7

1875

119

1934

EMAIL

2001

1936

65

1904

32

1920

ACC_MGR

2001

1996

5

0

1996

30

DT_PURCHASED

2001

1998

3

1090

908

1499

DT_ACC_OPEN

2001

1998

3

1093

905

1500

DT_LAST_PAYMENT

2001

1997

4

1026

971

1425

DT_LAST_PO_RAISED

2001

1998

3

1003

995

1433

BALANCE

2001

1999

2

7

1992

10

In most cases, drilling down on the numbers in the Summary View will take you directly to the records. However, some numbers take you to an interim view.

  • If you drill down on the 41 duplicate BUSINESS values in the Summary View, EDQ shows the frequency of each duplicate value.

  • If you drill down on the 8 distinct TITLE values in the Summary View, EDQ shows the frequency of each distinct value.

  • If you drill down on the 239 POSTCODE values without any data in the Summary View, EDQ shows a summary view of the different types of No Data found (though note that all of these will be Null values if the default No Data Handling Reference Data map is used in the Snapshot).