1.3.11.17 Extract Attributes

The Extract Attributes processor takes a single string containing any text as input, and outputs distinct pieces of information from the string, using reference data (regular expressions and/or literal strings) to drive the identification of attributes within the string. For example, it could be used to extract specific data items such as part number, quantity, color, etc from a product description text field.

It outputs the information as a correlated pair of arrays, one containing the attribute labels and the other containing their values.

It will also output Remaining Input, a representation of the input text with all extracted values stripped out leaving only the remaining text where no matches were found and no extractions made.

This affects the way that values are extracted. For example, if you want to extract Business Suffixes from a Company Name attribute, you may want to extract them only if the value ends with the value in the list.

Configuration Description

Inputs

The string to extract the attributes from.

Options

Specify the following options:

  • Regular Expressions to Match: The list of regular expressions to match, which is provided through a reference data containing two columns where the first is the regular expression and the second is the corresponding label to output in the AttributeArray. The definition of the reference data can be edited by clicking the Browse for Reference Data button. The default selection is empty.

  • Literal Values to Match:The list of literal values to match, which is provided through a reference data containing two columns where the first is the literal value and the second is the corresponding label to output in the AttributeArray. The definition of the reference data can be edited by clicking the Browse for Reference Data button. The default selection is empty.

  • Ignore Case: Whether or not to ignore case when matching the literal values in the specified list. Default is Yes.

Outputs

Number of records with extraction performed and extraction not performed.

Data Attributes

The following data attributes are output:

  • AttributeArray:Array of attribute labels extracted from the input string.

  • ValueArray:Array containing attribute values in the corresponding index of their labels.

  • RemainingInput:The text remaining in the input string after all attributes have been extracted, i.e. the text that did not match either a literal or a regular expression.

Flags

The following flags are output:

  • AttributesExtractedFlag: Y, if any attributes were extracted. N, if not.

Example

In this example a string is input, and result attributes and its values are output.

Input String Result Attribute/Result Value

TEAO HP = 1/4 1725RPM 115V 48YZ YOKE MTR

attributearray= {”Definition”, ”Brand”}

valuearray= {”HP = 1/4”, ”TEAO”}

remaininginput= 1725RPM 115V 48YZ YOKE MTR

Pencils #2HB Nontoxic Lead 12 / Box Wood

attributearray= {”Graphite Grade”, ”Grouping”, ”Stationary Type”}

valuearray= {”#2HB”, ”12 / BOX”, ”Pencils”}

remaininginput= Nontoxic Lead Wood