Data Profiles and Semantic Recommendations

After creating a data set, the data set undergoes column-level profiling to produce a set of semantic recommendations to repair or enrich your data. These recommendations are based on the system automatically detecting a specific semantic type during the profile step.

There are various categories of semantic types such as geographic locations identified by city names, a specific pattern such as a credit card number or email address, a specific data type such as a date, or a recurring pattern in the data such as a hyphenated phrase.

Semantic Type Categories

Profiling is applied to various semantic types.

Semantic type categories are profiled to identify:

  • Geographic locations such as city names.
  • Patterns such as those found with credit cards numbers or email addresses.
  • Recurring patterns such as hyphenated phrase data.

Semantic Type Recommendations

Recommendations to repair, enhance, or enrich the data set, are determined by the type of data.

Examples of semantic type recommendations:

  • Enrichments - Adding a new column to your data that corresponds to a specific detected type, such as a geographic location. For example, adding population data for a city.
  • Column Concatenations - When two columns are detected in the data set, one containing first names and the other containing last names, the system recommends concatenating the names into a single column. For example, a first_name_last_name column.
  • Semantic Extractions - When a semantic type is composed of subtypes, for example a us_phone number that includes an area code, the system recommends extracting the subtype into its own column.
  • Part Extraction - When a generic pattern separator is detected in the data, the system recommends extracting parts of that pattern. For example if the system detects a repeating hyphenation in the data, it recommends extracting the parts into separate columns to potentially make the data more useful for analysis.
  • Date Extractions - When dates are detected, the system recommends extracting parts of the date that might augment the analysis of the data. For example, you might extract the day of week from an invoice or purchase date.
  • Full and Partial Obfuscation/Masking/Delete - When sensitive fields are detected such as a credit card number, the system recommends a full or partial masking of the column, or even removal.

Recognized Pattern-Based Semantic Types

Semantic types are identified based on patterns found in the data.

Recommendations are provided for these semantic types:

  • Dates (in more than 30 formats)
  • US Social Security Numbers (SSN)
  • Credit Card Numbers
  • Credit Card Attributes (CVV and Expiration Date)
  • Email Addresses
  • North American Plan Phone Numbers
  • First Names (typical first names in the United States)
  • Last Names (typical surnames in the United States)
  • US Addresses

Reference-Based Semantic Types

Recognition of semantic types is determined by loaded reference knowledge provided with the service.

Reference-based recommendations are provided for these semantic types:

  • Country names
  • Country codes
  • State names (Provinces)
  • State codes
  • County names (Jurisdictions)
  • City names (Localized Names)
  • Zip codes

Recommended Enrichments

Recommended enrichments are based on the semantic types.

Enrichments are determined based on the geographic location hierarchy:

  • Country
  • Province (State)
  • Jurisdiction (County)
  • Longitude
  • Latitude
  • Population
  • Elevation (in Meters)
  • Time zone
  • ISO country codes
  • Federal Information Processing Series (FIPS)
  • Country name
  • Capital
  • Continent
  • GeoNames ID
  • Languages spoken
  • Phone country code
  • Postal code format
  • Postal code pattern
  • Phone country code
  • Currency name
  • Currency abbreviation
  • Geographic top-level domain (GeoLTD)
  • Square KM

Required Thresholds

The profiling process uses specific thresholds to make decisions about specific semantic types.

As a general rule, 85% of the data values in the column must meet the criteria for a single semantic type in order for the system to make the classification determination. As a result, a column that might contain 70% first names and 30% “other”, doesn't meet the threshold requirements and therefore no recommendations are made.