Quickstats Profiler

1.3.7.11 Quickstats Profiler

The Quickstats Profiler provides fundamental quality metrics for a number of records or transactions, highlighting:

Candidate key columns
Completeness and missing data
Duplication
Uniqueness and diversity of values

Each input attribute is profiled individually.

Quickstats is useful to establish a picture of some of the fundamentals of data and its quality.

Often documentation and meta-data information are missing, incomplete, out of date, or not trusted. It is important to produce an unequivocal picture of the data, from the data itself, so that mistakes are not made inadvertently through false assumptions.

The following table describes the configuration options:

Configuration	Description
Inputs	Specify any attributes from which you want to obtain quick profiling statistics.
Options	None.
Outputs	Describes any data attribute or flag attribute outputs.
Data Attributes	None.
Flags	The following flag is output: `[Attribute name].Populated`: indicates which reports are populated. Possible values are Y/N.

The Quickstats Profiler requires a batch of records to produce its statistics (for example, in order to tell how many duplicate values there are for each attribute analyzed). It must therefore run to completion before its results are available, and is not suitable for a process that requires a real time response.

When executed against a batch of transactions from a real time data source, it will finish its processing when the commit point (transaction or time limit) configured on the Read Processor is reached.

The following table describes the statistics produced by the profiler for each attribute:

Statistic	Description
With data	The number of records with data in that attribute.
Without data	The number of records without data in that attribute. This includes records that had a NULL value, and those that contained other types of No Data, such as only white space or non-printing characters. Drill down on the number to see a breakdown of the types of No Data found.
Singletons	The number of records with values that were found only once in that attribute.
Duplicates	The number of records with values that were found more than once in that attribute.
Distinct	he number of different values that were found in the attribute. Drill down on the number to see a breakdown of these values by their frequency of occurrence.
Comments	Automated comments based on the findings of the Quickstats profiler. See below.

Clicking on the Additional Information button will show the above statistics as percentages of the total number of records analyzed.

Automated Comments

Automated Comments are generated in order to highlight potential areas of interest in the data. For example:

Where an attribute is 100% complete and unique, it is identified as a possible key
Where an attribute is nearly 100% complete and unique it is highlighted as a possibly damaged key
Where an attribute is nearly 100% complete (suggesting blanks are not expected), the comment prompts the user to investigate nulls
Where an attribute is nearly 100% unique (suggesting duplicates are not expected), the comment prompts the user to investigate duplicates
Where an attribute has only one distinct value, the comment suggests that the attribute may be redundant

Note that where many of the above comments apply, the comments are concatenated.

Example

In this example, the Quickstats Profiler is used to gain an initial overview of a table of Customer records.

Table 1-123 Quickstats Profiler Example

Input Field	Record Total	With Data	Without Data	Singletons	Duplications	Distinct Values
CU_NO	2001	2000	1	1997	3	1998
CU_ACCOUNT	2001	2000	1	2000	0	2000
TITLE	2001	1862	139	3	1859	8
NAME	2001	2000	1	1980	20	1990
GENDER	2001	1853	148	0	1853	2
BUSINESS	2001	1670	331	1629	41	1649
ADDRESS1	2001	1999	2	1926	73	1954
ADDRESS2	2001	1921	80	554	1367	839
ADDRESS3	2001	1032	969	278	754	379
POSTCODE	2001	1762	239	1604	158	1672
AREA_CODE	2001	1884	117	64	1820	270
TEL_NO	2001	1994	7	1875	119	1934
EMAIL	2001	1936	65	1904	32	1920
ACC_MGR	2001	1996	5	0	1996	30
DT_PURCHASED	2001	1998	3	1090	908	1499
DT_ACC_OPEN	2001	1998	3	1093	905	1500
DT_LAST_PAYMENT	2001	1997	4	1026	971	1425
DT_LAST_PO_RAISED	2001	1998	3	1003	995	1433
BALANCE	2001	1999	2	7	1992	10

In most cases, drilling down on the numbers in the Summary View will take you directly to the records. However, some numbers take you to an interim view.

If you drill down on the 41 duplicate BUSINESS values in the Summary View, EDQ shows the frequency of each duplicate value.
If you drill down on the 8 distinct TITLE values in the Summary View, EDQ shows the frequency of each distinct value.
If you drill down on the 239 POSTCODE values without any data in the Summary View, EDQ shows a summary view of the different types of No Data found (though note that all of these will be Null values if the default No Data Handling Reference Data map is used in the Snapshot).