1.3.9.8 Phrase Profiler

The Phrase Profiler analyzes a number of attributes and searches for common words and phrases.

The returned words and phrases are returned in order of their frequency within all the input attributes.

The Phrase Profiler is a quick way of discovering the most frequent and significant words and phrases in the data, and where they occur. You can then use the results of phrase profiling to drive the configuration of the Parse processor. For example, you can add the words and phrases that were found to Reference Data lists used to classify data, and, by seeing which words and phrases occur in which attributes, work out which token checks to apply to which attributes.

The Phrase Profiler is therefore an important tool to use when understanding the content of text fields, especially when you may need to improve or otherwise change the structure of the data (for example, for a data migration).

The following table describes the configuration options:

Configuration Description

Inputs

Specify any string attributes that you want to analyze for common words or phrases.

Options

Specify the following options:

  • Cutoff frequency (parts per million): Allows you not to return words or phrases that only occur a small number of times in the data set, expressed in parts per million to represent a small percentage of the records analyzed. For example, values that occur less frequently than 100 times in each million records (that is, in 0.0001% of records). Type: number. Default value: 5000.

  • Allowable variation (parts per million):Allows you to cut off further insignificant phrases (that are contained within others), and mark top-level phrases as more significant, by expressing the allowable variation in frequency between two phrases that contain each other. Type: number. Default value: 5000.

  • Maximum words in a phrase type: Sets a maximum length of phrases to return, in number of words. Type: selection of common delimiter characters. Default value: 10. The maximum value for this option is 20, for performance reasons.

  • Additional word delimiter: Allows the definition of an additional separator character (as well as the normal space character) that will be used to separate words and phrases. Type: selection of common delimiter characters. Default value: None.

  • Word delimiter regular expression: Allows the definition of a regular expression to be used to separate words and phrases. Type: regular expression. Default value: None.

  • Ignore case?: Sets whether or not to distinguish between words or phrases that are the same except for case differences. Setting the Ignore case? option to Yes will mean that words and phrases will be represented in lower case in the results. Drilling down will reveal the data in its original case, as the data itself has not been transformed. Type: Yes/No. Default value: No.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

None.

Execution

Execution Mode Supported

Batch

Yes

Real time Monitoring

Yes

Real time Response

No

A large dataset containing free text will typically contain a large number of distinct phrases with only a few of them being significant in understanding the content of the dataset.

The Phrase Profiler provides two main settings to help eliminate insignificant results: the Cutoff frequency and the Allowable variation.

Cutoff frequency

Typically, the Phrase Profiler will generate a relatively small collection of phrases that occur in a large number of records and are potentially significant, together with a very large number of phrases that occur in a small number of records and so are less significant. You may want not to include the less frequent phrases in the results. As the absolute cutoff frequency varies depending on the size of the dataset, it is convenient to express the Cutoff frequency setting as a frequency per million input records.

Allowable variation

Where a phrase consists of many words (or a substring consists of many characters), longer phrases will include shorter phrases, so that data that includes the phrase 'Newcastle Upon Tyne' will also include at least the same number of sub-phrases 'Newcastle Upon' and 'Upon Tyne'.

If the two sub-phrases occur with exactly the same frequency as the full phrase and there is no variation in their frequencies, then the full phrase is significant (a 'top-level phrase') and the sub-phrases are not. The sub-phrases are therefore excluded from the results.

If the sub-phrases occur more frequently than the full phrase, however, then they become more interesting and the variation in frequency between a phrase and a sub-phrase is a measure of the independent significance of the sub-phrase. So you may specify an Allowable variation to remove sub-phrases with a variation in frequency that is below this value. Again, as the absolute variation varies depending on the size of the dataset, it is convenient to express the Allowable variation setting as a variation per million input records.

Example

Consider the following parameters:

  • 1 million records are analyzed by the Phrase Profiler

  • The Cutoff frequency is set to 100 parts per million

  • The Allowable variation is set to 50 parts per million

  • There are 400 occurrences of the phrase 'Newcastle Upon Tyne'

  • There are 50 occurrence of the phrase 'Newcastel Upon Tyne'

The phrase 'Newcastle Upon Tyne' appears in the results but 'Newcastel Upon Tyne' does not because of the cutoff. The sub-phrase 'Upon Tyne' has a frequency of 450 and so is unaffected by the cutoff, but does not appear in the results because the frequency variation of 50 with its containing phrase is just within the allowable limit. If 'Upon Tyne' appeared in just one more record, anywhere within the data, then it would appear in the results as potentially significant. It is generally appropriate to set the Cutoff frequency and Allowable variation to the same value.

Marking top-level phrases

Sometimes it is useful to know if a phrase is a sub-phrase of something else or if it is a 'top level phrase'. In the above example, 'Newcastle Upon Tyne' may be a top-level phrase - in which case it presumably represents a city. However, if there were just one occurrence of the phrase 'Newcastle Upon Tyne Borough Council', and this occurrence is included in the results (not excluded by either the Cutoff or Allowable Variation options) then 'Newcastle Upon Tyne' would no longer be a top-level phrase and so may sometimes represent something other than a city. The Phrase Profiler flags top-level phrases in the results.

The following table describes the statistics produced by the profiler. The Phrase Profiler produces a summary view of its results, showing the words and phrases that were found in the input attributes in order of their frequency of occurrence.

Statistic Description

Size

.The size of the phrase, in number of words.

Top Phrase

.Indicates whether or not the phrase is a top-level phrase.See the note above explaining the Allowable variation setting.

Phrase

.The word or phrase that was found in the data.

Frequency

The number of occurrences of the phrase or word. Note that when drilling down to the data, you may see fewer records than this frequency, because the same phrase or word may occur more than once in some records.

[Attribute].freq

The number of occurrences of the phrase or word within each input attribute.

Example

In this example, Customer Name and Address data is analyzed with a view to parsing it to resolve any structural issues. The Phrase Profiler is run in order to find the most common words and phrases in the name and address attributes. The options are configured as follows:

  • Cutoff frequency: 5000

  • Allowable variation: 5000

  • Maximum words in a phrase: 10

  • Additional word delimiter: comma (,)

  • Word delimiter regular expression: not used

  • Ignore case?: No

For example, if the words 'Mr', 'Ms', 'Mrs' and 'Miss' are frequently occurring, and valid, Titles, so we might create a Reference Data list for classifying them in parsing. We can then sort the results by the Title attribute to find further values that occur.