4 Standardizing Item Definitions

Standardizing individual Item Definitions is different than assigning global standardizations as described in previous chapter. With Item Definition standardization, you select how you want the data to be standardized within the context of a specific Item Definition and the standardization only applies to that one Item Definition.

For example, you might want to standardize pen color differently from pencil color. By creating a rule for the color attribute of pens, you can ensure that the output data contains colors as they relate to pens only. A separate color rule for pencils could be defined so that the output relates only to pencil colors. These rules would not globally standardize colors for all Item Definitions, only for the pen and pencil Item Definitions; these rules would override any global standardization.

Standardize Items Tab

The Standardize Item tab and the associated sub-tabs provide you with all of the functionality needed to standardize your data at Item Definition level.

Standardize Attributes Sub-Tab

The Standardize Attributes sub-tab is the default when selecting the Standardize Items tab for the first time.

Surrounding text describes attrisubrule.png.

The Standardize Attributes sub-tab has several distinct functional representations to provide you with the ability to numerically rewrite rules, automatically rewrite rules globally, or reorder phrase productions.

Item Definitions Pane

All of the Item Definitions in your data lens are displayed in the Item Definitions pane. You can select an Item Definition, attribute, phrase, or term.

The full hierarchy of the parent Item Definition is displayed though all children Item Definitions do not automatically display the underlying attributes. You must double-click on an Item Definition to expand the full attribute structure for viewing and standardizing. Alternatively, you can use the Expand Node option. However, if the parent Item Definition contains attributes (identified by a red, square icon rather than blue), this Item Definition and its children Item Definitions are automatically expanded and fully visible.

Note:

The attributes of a parent Item Definitions that only contains attributes (denoted with a red, square icon) can be standardized in the same manner as all other attributes.

Once you have expanded an Item Definition, the attributes are displayed and will remain so because they have been added to the Item Definition hierarchy and are now in memory. Displaying an entire large Item Definition hierarchy requires a great deal of memory so this behavior ensures that the Item Definition pane is populated quickly and with a minimum of memory.

The behavior of the Standardize Attributes sub-tab is dependent on the selection made in this pane. For example, the selection of a term results in the appearance of the Nodes for Insertion pane and the Rewrite Rules pane changes to an Ordering Rule pane.

The context-sensitive menu for this pane has the following possible menu options:

Expand Node

Use this option to expand the hierarchical structure of the current selection.

Copy to…

See "Sharing Item Definitions Standardizations".

Copy with Children to…

See "Sharing Item Definitions Standardizations".

Copy to another Item Definition…

See "Sharing Item Definitions Standardizations".

Jump to Standardization

Use this option to go to the Standardize Phrases sub-tab of the Standardize tab to set a global standardization rule.

Rewrite Rules Pane

The Rewrite Rules pane makes it easy for you to create standardization rules by providing four different methods that are based on the type of selection made in the Item Definitions pane.

Numeric Rewrite Rules

The Numeric Rewrite Rules representation of the Rewrite Rules pane allows you to standardize number-based Item Definition attributes based on numerical values, such as standard units of measurement.

The default Numeric Rewrite Rules pane is a text replacement.

Surrounding text describes numrewritetxt.png.

The Unit of Measure for Table list allows you to select the numerical value you want to standardize in the table below it. This list is populated with all of the units of measure in your data lens.

Buttons

The buttons are used as follows:

Add Row: Adds rows to the table one row at a time.
Remove Row: Deletes the row that is active or the bottom row. Some rows cannot be deleted.

Caution:
There is no delete verification prompt and rows cannot be restored.
Sort Table: The table is sorted.

Options

The Text Replace and Value for Target options allow you to choose a method of replacement for the selected rules, by text or by value.

Text Replace Table

The columns of the text replace table operate as follows:

>=: Indicates the lower bound and cannot be edited.
Number: The number to replace with standardized text, which is automatically populated based on the entries in the Upper Bound column.
Upper Bound: A number at which you want this replacement rule to stop (upper bound) for the given Number. Depending on the preceding row, the fields in this column are either automatically populated or you can enter new data.
Replacement Text: Allows you to enter the text with which the specified Number will be replaced.

Value Replace Table

The value replacement table operates similar to the text replace table.

Surrounding text describes numrewriteval.png.

The first three columns operate as previously described. The Target and UOM Spelling columns replace the Replacement Text column so that you can provide values rather than text.

The Target column is a list of unit of measurements from which you can choose.

The UOM Spelling column allows you to enter what spelling the output data will be for the selected Target.

If you want to delete a value replacement table associated with a particular phrase, right-click on the phrase and click Reset Table.

Simple Rewrite Rules

This simplified Rewrite Rules pane allows you to define new text for the selected rule by entering the replacement text into the Rewrite As field.

Surrounding text describes simrewrite.png.

Automatic Rewrite Rules

Selecting a term from the Item Definitions pane, changes the Numeric Rewrite Rule pane to an automatic Rewrite Rules pane to allow you to set a global standardization rule for the selected term. This functionality is identical to the Standardize Phrases sub-tab of the Standardize tab and is described in the previous chapter.

Reorder Phrase Productions

Selecting a phrase production from the Item Definitions pane, changes the Numeric Rewrite Rules pane to an Ordering Rule pane to allow you to reorder the phrase or add productions. This functionality is identical to the Standardize Phrases sub-tab of the Standardize tab and is described in the previous chapter.

Order Attributes Sub-Tab

The Order Attributes sub-tab allows you to define the order in which Item Definition attributes are ordered in the output data.

Surrounding text describes ordattsub.png.

Item Definition Pane

This pane allows you to choose only a specific Item Definition or all Item Definitions in your data lens.

Note:

The active selections in this section are based on which Item Definition you have selected.

Selection Pane

The Selection pane allows you to select which Item Definition attributes you want to match on (show) by moving attributes from the Ordered Item Definition Attributes to Show list to the Item Definition Attributes not Shown list. All of the attributes in the Ordered Item Definition Attributes to Show list will be used for matching and shown in your standardized output.

Attributes are moved between lists using the right and left arrows or by double-clicking on an attribute. The up and down arrows are used to change the match and output representation order.

The Reset All Rules button is only active when the Item Definitions folder has been selected. It can be used to reset all Null and Multiple Instance Handling settings for all of the Item Definitions in your data lens.

Null Handling

The Null Handling section allows you to choose how you want null attribute values processed and is applicable to all the attributes listed in the Selection pane. The processing choices for null values are as follows:

Allow Null: Empty attributes are allowed.
Replace Null: Empty attributes are replaced with the text entered in the field.
Mandatory - disallow Null: Empty values are not allowed in the data lens.
Ignore Null values: Empty values are ignored.

Multiple Instance Handling

The Multiple Instance Handling section allows you to choose how you want multiple attribute values to be processed. The processing choices are as follows:

Allow Multiple Instances: Multiple instances of an attribute are allowed.
Replace Multiple With: Multiple instances of an attribute are replaced with the text entered in the field.
Disallow Multiple Instances: This feature is deprecated.
Ignore Extra Instances: The first occurrence of an attribute is selected for output; all subsequent attribute values are ignored.
Concatenate With: The unique multiple instances of an attribute are concatenated together with the specified string. Additionally, this operation sorts the attribute values (either numerically or alphabetically, depending on the nature of the attribute value.)

Match Weights Sub-Tab

The Match Weights sub-tab allows you to set the number of attributes to be matched and the order the selected attributes are matched for a specific Item Definition. This sub-tab is not active until a Match Type is created as described in "Match Type".

Surrounding text describes matweigsub.png.

Item Definition Pane

This pane allows you to choose one Item Definition for which you want to change the match ranking.

Selection Pane

The Selection pane allows you to select, which Item Definition attributes you want to match on from those displayed in the list. This selection occurs automatically by increasing or decreasing the number in the Number of required attributes list. The selection begins with the first item in the list and sequentially selects downward.

The up and down arrows are used to change the match ranking of the selected attributes, moving those that are most important to use for matching to the top of the list. These arrows are activated when you select one attribute from the list.

After you have identified your matching attributes in each Item Definition, your Match Type is complete. The use of these attributes to create matched items occurs within the Matching DSA.

Note:

Matching data is a complex process that is configured using the Application Studio, Knowledge Studio, as well as the Governance Studio.

Duplicate Matching

Duplicate and nearly duplicate data is identified and matched for an Item Definition. There are many reasons for wanting to identify duplicate data, both within and between data sets. The problem with duplicate identification is that records may not have identical forms so they cannot be found through standard string comparison methods. For example, 'Ballpoint Pen Refill Med Pt Black Ink 2 / Pk' and 'Ballpoint Pen Refill Medi, Point black, ink 2 / Pack' might refer to to the same item, even though 'Pack' is spelled differently.

The actual matching of data occurs within a DSA though the foundation for creating matches is in the data lens. The matching function is based on a comparison of attribute values. If the match-attribute values for multiple lines of data are equivalent, then the items are identified as matches.

Depending on the use case you develop, you may want to match on part number, manufacturer, and brand. In that case, you would select these three attributes to participate in the matching process.

Another use-case example is one that requires matching items based on form, fit, and function. The match could be defined based on length, width, height, color, and material.

It is important to realize the following:

matching requires both a standardization type and a match type
matching is done on attributes occurring within a given Item Definition
matching requires certain settings for null and multiple attribute handling

Test Attributes Sub-Tab

The Test Attributes sub-tab allows you to review your attribute standardization rules by Item Definition that are applied to your sample data to validate your results.

Surrounding text describes tstattsub.png.

Item Definition Section

All of the Item Definitions in your data lens are listed for your selection.

The Show Value and Show Text options can be used to change the way the data is viewed. Select Show Text to ensure that units of measure are displayed.

You can test the attributes for any of the Item Definitions in your data lens by selecting a different Item Definition from the list. All of the data on this tab is changed to display the date related to the new Item Definition selection.

By default, only the data for active Item Definitions is displayed. For information about viewing inactive Item Definition data, see "Active vs. Inactive Item Definitions".

Sample Data Table

This table displays the original data and the same data after it has been standardized. The columns, left to right, indicate the following:

Line Number (#): The unique number assigned to that line of data.
Quality Index (QI): A number between 0 and 100 that represents the degree to which the Item Definition for the line has been standardized.
Red Check Mark: Data you have reviewed and marked as such by selecting the check box in that line of data.
Initial Text: The original data that was parsed by the data lens.
Remaining Columns: The remaining columns are dependent on the attributes for each Item Definition so these columns vary.

Each of the columns that contain data can be used to sort the table, both ascending and descending, by clicking on the column title. Clicking a column heading once sorts the table, by the items in the selected column, in ascending alphabetically order. Clicking the same column heading a second time sorts the table again in descending alphabetical order.

Selecting one of the lines in the Sample Data Table displays the Item Definition information for the selection in the Standardized Attribute Table.

Source Field

This field contains the original data. This field can be edited and when you press Enter, you can review the immediate effects on the data lens.

Standardized Text Section

The standardized version of the original data is displayed in the field; it cannot be edited.

The Attribute Separator field allows you to specify a delimiter for the attribute values.

The Append Unattributed Text check box allows you to append the data that has been recognized though is not part of the Item Definition.

Similarly, the Append Unparsed Text check box appends unparsed text to your description.

Standardized Attributes Table

All of the standardized attributes for the selected Item Definition and the line selected in the Sample Data pane are displayed in this table. The attribute standardization selections from the Standardized Text section are reflected in the way this field is displayed.

Test Item Standardization Sub-Tab

The Test Item Standardization sub-tab allows you to review the Item Definition standardization rules that you have created applied to your sample data to validate your results.

Surrounding text describes tesitemsub.png.

By default, only the data of active Item Definitions is displayed. For information about viewing inactive Item Definition data, see "Active vs. Inactive Item Definitions".

Sample Data Table

Several of the columns of this table are the same as those on the Test Attributes sub-tab (see "Test Attributes Sub-Tab") and this table operates the same way. The differing columns are as follows:

Length (Len): Indicates the character length of the original text.
Standardized Text: Indicates the standardized original text.

Source Field

This field contains the original data. This field can be edited and when you press Enter, you can review the immediate effects on the data lens.

Standardized Text Section

The standardized version of the original data is displayed in the field; it cannot be edited.

The Attribute Separator field allows you to enter a textual separator for use between attributes.

The Append Unattributed Text check box; selecting this box appends all text that has not been attributed to your description.

Similarly, the Append Unparsed Text check box appends unparsed text to your description.

Item Attributes Section

All of the attributes for the selected Item Definition are displayed in this field; it cannot be edited. The attribute standardization selections from the Standardized Text Section are reflected in the way this field is displayed.

Regression Test Sub-Tab

Regression testing is an important part of data standardization so that you can be sure that your data output is as you expect.

Standardize Items regression testing only works for lines that have an Item Definition. If there is no Item Definition that recognizes the line, no regressions are displayed.

If the tab is not active, set the Regression Testing Active data lens option. For more information, see "Ensuring Regression Testing is Active".

Surrounding text describes regtstsubstanitem.jpg.

Sample Data Table

This table displays the regression and current view of the sample data and attributes.

If there is no data displayed in the Sample Data Table, the sample data has not been initialized; a regression base does not exist. For information about initializing the regression base, see "Creating and Updating the Regression Base".

The columns, left to right, indicate the following:

#

Line Number assigned to the line of sample data.

AC

Attribute Count displays the attribute count for the Regression and Current Attributes.

AD

Attribute Differences displays number of attribute differences between the Current and Regression files.

Regression

Item Definition assigned to the line of data based on the Regression file.

Red Check Mark

The red check mark or Review column indicates new or changed lines of data and the text on these lines should be reviewed. If the information in the Current Text column is correct and you want to accept the changes as valid progressions, select this check box so that the data is included in the regression testing.

The regression base is updated with the reviewed and accepted lines of text using the Update Regression Base option on the File menu.

Current

Item Definition assigned to the line of data based on the Current file.

ID

Unique identifier assigned to the sample data that was included as part of the sample data.

Standardized Data

This is based on any Global Line Order standardization that may have been setup.

Current and Regression Attributes Sections

These sections display the current and regression attributes for the selected line of data in the Sample Data Table.

Creating and Updating the Regression Base

The best practice in creating a regression base is to combine your sample data into a one file. For more information, see "Combining Sample Files".

Combining files does not remove any data; it simply combines the selected sample files into a new, larger file.

Next, make single changes to your regression base sample data file, check your regression sets, and update them as appropriate. Making multiple changes can make the regressions hard to read, which increases the chance that an error is overlooked or is much harder to fix.

To create the regression base, select the Create New Regression Base option on the File menu, and then select the sample data file that you want to use for regression testing. This initializes the regression base and displays the results in the After pane.

You can update the regression base with the reviewed and accepted lines of text (as previously described in red check mark column) using the Update Regression Base option on the File menu.

Note:

You should only initialize or update the regression base if you have reviewed or accepted the sample data.

Match Type

To use the Enterprise DQ for Product matching functionality you must create a match schema, or match type, to be used throughout your data lens. You can create one or more match types that can be used to change how your data is matched depending on your use case.

Note:

You must configure how you want multiple instances and null values handled before you create a match type as described in"Order Attributes Sub-Tab".

This feature is accessed from the Data Lens menu, by clicking Match Types…, and then clicking the Add New button.

Surrounding text describes newmattype.jpg.

Enter the requested information to create your new match type that will be added as a selection option to the Match Types list.

The creation and selection of a Match Type is necessary to activate the Match Weights sub-tab.