Most standardization files for person data are specific to each national domain. Each domain node within the Standardization node of the project includes the files defined in this section. The domain corresponding to each file is indicated at the end of the file name; for example, personConstantsUK.cfg and personConstantsFR.cfg. These domain abbreviations are indicated by an asterisk (*) in the descriptions.
You can customize these files to add entries of other nationalities or languages, including those containing diacritical marks.
The conjunction reference file is not currently used, but is designed to work with the person name patterns file during standardization.
The person constants file defines certain information about the standardization files used for processing person data, primarily the number of lines contained in each file. The number of lines specified here must be equal to or greater than the number of lines actually contained in each file. The constants file for United States data is in the Standardization node of the project and is named personConstants.cfg; the person constants file for the other domains is located under the domain name node.
Table 9 lists and describes each parameter in the constants file. The files referenced by these parameters are described on the following pages.
Table 9 Person Constants File Parameters
Parameter |
Description |
---|---|
The maximum number of words in a given free-form text field containing a person name. This parameter is not currently used. |
|
The maximum number of lines in the person conjunction reference file (personConjon*.dat). |
|
The maximum number of lines in the generational suffix category file (personGenSuffix*.dat). |
|
The maximum number of lines in the first name category file (personFirstName*.dat). |
|
The maximum number of lines in the last name category file (personLastName*.dat). |
|
The maximum number of lines in the last name prefix category file (personLastNamePrefix*.dat). |
|
The maximum number of lines in the title category file (personTitle*.dat). |
|
The maximum number of lines in the occupational suffix category file (personOccupSuffix*.dat). |
|
The maximum number of lines in the business name reference file (businessOrRelated*.dat). |
|
The maximum number of lines in the person patterns file (personNamePatt.dat). |
|
The maximum number of lines in the two-character reference file for occupational suffixes (personTwo*.dat). |
|
The maximum number of lines in the three-character reference file for occupational suffixes (personThree*.dat). |
|
The maximum number of lines in the special characters reference file (personRemoveSpecChars.dat). |
|
The maximum number of lines in the hyphenated name category file (personFirstNameDash.dat). |
The first name category file defines standardized versions of first names and assigns a gender classification for each name. This file is used to standardize first names when comparing person names. The gender classification helps to further clarify the match. The Sun Match Engine uses this file when a first name field is defined for normalization or standardization in the Match Field file.
The syntax of this file is:
original-value standardized-form gender-class
You can modify or add entries in this table as needed. Table 10 describes the columns in the personFirstName*.dat file.
Table 10 First Name Category File
Following is an excerpt from the personFirstNameUS.dat file. Certain rows contain a zero (0) for the standardized form, indicating that the name is already standard (for example, Stephen, Sterling, and Summer).
STEPHEN 0 M STEPHENIE STEPHANIE F STEPHIE STEPHANIE F STEPHINE STEPHANIE F STEPHNIE STEPHANIE F STERLING 0 M STEVE STEPHEN M STEVEN STEPHEN M STEVIE STEPHEN N STEW STUART M STEWART STUART M STU STUART M STUART 0 M SU SUSAN F SUE SUSAN F SUHANTO 0 M SULLIVAN 0 F SULLY SULLIVAN F SUMMER 0 F |
The generational suffix category file defines standardized versions of generational suffixes, such as Jr., III, and so on. This file is used to compare standard versions of the suffix field. You can define additional suffixes and their standardized form following the syntax below.
field-value standard-form
Table 11 describes each column of the personGenSuffix*.dat file.
Table 11 Generational Suffix Category File
Column |
Description |
---|---|
The original value of the generational suffix in the record being processed. |
|
standard-form |
The standard form of the generational suffix. A zero (0) in this column indicates that the value listed in column one is already in its standardized form. If this column contains a suffix instead of a zero, that suffix must also be listed in a different entry as an original value with a standard form of “0”. |
An excerpt from the personGenSuffixUS.dat file appears below. In this excerpt, certain suffixes, such as 2ND, 3RD and JR, are already in their standardized form.
11 2ND 111 3RD 1V 4TH 2ND 0 3RD 0 4TH 0 FOURTH 4TH II 2ND III 3RD IV 4TH JR 0 JUNIOR JR SECOND 2ND SENIOR SR |
The last name prefix category file defines standardized versions of last name prefixes, such as “Van” or “Le”. This file is used to standardize these prefixes prior to standardizing the last name when comparing person names. The Sun Match Engine uses this file when a last name field is defined for normalization or standardization in the Match Field file.
The syntax of this file is:
original-value standardized-form
You can modify or add entries in this table as needed. Table 12 describes the columns in the personLastNamePrefix*.dat file.
Table 12 Last Name Prefix Category File
Column |
Description |
---|---|
The original value of the last name prefix. |
|
standardized-form |
The standardized version of the original value. A zero (0) in this field indicates that the original value is already in its standardized form. If this column contains a prefix instead of a zero, that prefix must also be listed in a different entry as an original value with a standardized form of “0”. |
Following is an excerpt from the personLastNamePrefixUS.dat file. Some of these prefixes are already in their standardized form, such as “Los” and “Mac”.
LOS 0 MAC 0 MC MAC SAINT 0 ST SAINT VAN 0 VAN DER 0 VANDE VAN DER |
The last name category file defines standardized versions of last names. This file is used to standardize last names when comparing person names. The Sun Match Engine uses this file when a last name field is defined for normalization or standardization in the Match Field file.
The syntax of this file is:
original-value standardized-form
You can modify or add entries in this table as needed. Table 13 describes the columns in the personLastName*.dat file.
Table 13 Last Name Category File
Column |
Description |
---|---|
The original value of the last name. |
|
standardized-form |
The standardized version of the original value. A zero (0) in this field indicates that the original value is already in its standardized form. If this column contains a name instead of a zero, that name must also be listed in a different entry as an original value with a standardized form of “0”. |
Following is an excerpt from the personLastNameUS.dat file.
FINK 0 PHINQUE FINK |
The occupational suffix category file is not currently used, but is designed to work with the person name patterns file during standardization.
This reference file is not currently used, but is designed to work with the person name patterns file during standardization.
The title category file defines standard forms for titles and classifies each title into a gender category. For example, “Mister” is standardized to “MR” and is classified as male; “Doctor” is standardized to “DR” and is classified as gender neutral. You can add, modify, or delete entries in this file as needed. Use the following syntax.
original-value standardized-form gender-class
Table 14 describes each column of the personTitle*.dat file.
Table 14 Person Title Category File
An excerpt from the personTitleUS.dat file appears below. In this excerpt, certain titles, such as DR, GEN, and MISS, are already in their standardized form.
CTO 0 N DEAN 0 N DIR DIRECTOR N DIRECTOR 0 N DOC DR N DOCTOR DR N DR 0 N DRS 0 N EMERITUS 0 N FOUNDER 0 N GEN 0 N GENERAL GEN N MANAGER 0 N MGR MANAGER N MISS 0 F MISSUS MRS F |
This reference file is not currently used, but is designed to work with the person name patterns file during standardization.
The business-related category file is used to identify business terms in person name information. Examples of when this could occur would be when indexing both person and business names or when business information is included within a person object structure. The Sun Match Engine removes these terms for person matching. This file contains a list of common business terms that might be found in person data. You can modify this file by adding, changing, or deleting terms.
An excerpt from the businessOrRelatedUS.dat file appears below.
ACCOUNTANT ACCT ACDY ACRE ACREAGE ACRES ACS ACT AD ADATU ADM ADMIN ADMINISTRATIO ADMINISTRATION ADMINISTRATOR |