1.3.7.1 Character Profiler

Use the Character Profiler to discover all the distinct characters that exist in a number of text attributes, and how often they occur.

The Character Profiler is particularly useful to find unexpected characters in text attributes that may need to be checked for on an ongoing basis (using Invalid Character Check), removed (using Denoise) or replaced (using Character Replace). Normalizing character discrepancies is also useful before Parsing.The results are created so that they can easily be added to Reference Data for any of the above purposes.Also, where a source of data contains records from a number of different countries, the Character Profiler can help to understand the ranges of characters in the data.

The following table describes the configuration options:

Configuration Description

Inputs

Specify any String attributes that you want to search for character instances.

Options

None.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

None.

The following table describes the statistics produced by the profiler:

Statistic Description

Character

The character found in the data.

Decimal

The decimal Unicode character reference. Note that a hash character is used to prefix the character references, so that the references can be used directly in Reference Data.

Hex

The hexadecimal Unicode character reference. Note that #x is used to prefix the character references, so that the references can be used directly in Reference Data.

Total

The total number of occurrences of the character across the selected input attributes.

Record Count

The number of records containing the character in any of the selected input attributes.

[Attribute name] Total

The number of occurrences of the character in the attribute.

[Attribute name] Record Count

The number of records containing the character in the attribute.

Example

For example, the Character Profiler is used to find unusual characters in some multi-language data from a Unicode database. The user chooses to look at the low frequency characters first by sorting the results by the Total column (ascending).

Table 1-120 Character Profiler

Character Decimal Hex Total (asc)

ñ

#241

#0xF1

1

ò

#242

#0xF2

1

ó

#243

#0xF3

1

ô

#244

#0xF4

1

õ

#245

#0xF5

1

ö

#246

#0xF6

1

ø

#248

#0xF8

1