Data Profiles and Semantic Recommendations

When you create a dataset, Oracle Analytics performs column-level profiling to produce a set of semantic recommendations to repair or enrich your data. When you create workbooks, you can also include knowledge enrichments in your visualizations by adding them from the Data Panel.

These recommendations are based on the system automatically detecting a specific semantic type during the profile step. For example, datasets based on local subject areas are profiled using a simple Top N sample.

There are categories of semantic types such as geographic locations identified by city names, recognizable patterns as in credit cards, email addresses and social security numbers, dates, and recurring patterns. You can also create your own custom semantic types.

Semantic Type Categories

Profiling is applied to various semantic types.

Semantic type categories are profiled to identify:

  • Geographic locations such as city names.
  • Patterns such as those found with credit cards numbers or email addresses.
  • Recurring patterns such as hyphenated phrase data.

Semantic Type Recommendations

Recommendations to repair, enhance, or enrich the dataset, are determined by the type of data.

Examples of semantic type recommendations:

  • Enrichments - Adding a new column to your data that corresponds to a specific detected type, such as a geographic location. For example, adding population data for a city.
  • Column Concatenations - When two columns are detected in the dataset, one containing first names and the other containing last names, the system recommends concatenating the names into a single column. For example, a first_name_last_name column.
  • Semantic Extractions - When a semantic type is composed of subtypes, for example a us_phone number that includes an area code, the system recommends extracting the subtype into its own column.
  • Part Extraction - When a generic pattern separator is detected in the data, the system recommends extracting parts of that pattern. For example if the system detects a repeating hyphenation in the data, it recommends extracting the parts into separate columns to potentially make the data more useful for analysis.
  • Date Extractions - When dates are detected, the system recommends extracting parts of the date that might augment the analysis of the data. For example, you might extract the day of week from an invoice or purchase date.
  • Full and Partial Obfuscation/Masking/Delete - When sensitive fields are detected such as a credit card number, the system recommends a full or partial masking of the column, or even removal.

Recognized Pattern-Based Semantic Types

Semantic types are identified based on patterns found in your data.

Recommendations are provided for these semantic types:

  • Dates (in more than 30 formats)
  • US Social Security Numbers (SSN)
  • Credit Card Numbers
  • Credit Card Attributes (CVV and Expiration Date)
  • Email Addresses
  • North American Plan Phone Numbers
  • US Addresses

Reference-Based Semantic Types

Recognition of semantic types is determined by loaded reference knowledge provided with the service.

Reference-based recommendations are provided for these semantic types:

  • Country names
  • Country codes
  • State names (Provinces)
  • State codes
  • County names (Jurisdictions)
  • City names (Localized Names)
  • Zip codes

Recommended Enrichments

Recommended enrichments are based on the semantic types.

Enrichments are determined based on the geographic location hierarchy:

  • Country
  • Province (State)
  • Jurisdiction (County)
  • Longitude
  • Latitude
  • Population
  • Elevation (in Meters)
  • Time zone
  • ISO country codes
  • Federal Information Processing Series (FIPS)
  • Country name
  • Capital
  • Continent
  • GeoNames ID
  • Languages spoken
  • Phone country code
  • Postal code format
  • Postal code pattern
  • Phone country code
  • Currency name
  • Currency abbreviation
  • Geographic top-level domain (GeoLTD)
  • Square KM

Required Thresholds

The profiling process uses specific thresholds to make decisions about specific semantic types.

As a general rule, 85% of the data values in the column must meet the criteria for a single semantic type in order for the system to make the classification determination. As a result, a column that might contain 70% first names and 30% “other”, doesn't meet the threshold requirements and therefore no recommendations are made.

Custom Knowledge Recommendations

Use custom knowledge recommendations to augment the Oracle Analytics system knowledge. Custom knowledge enables the Oracle Analytics semantic profiler to identify more business-specific semantic types and make more relevant and governed enrichment recommendations. For example, you might add a custom knowledge reference that classifies prescription medication into USP drug categories Analgesics or Opioid.

You can use existing semantic files such as Unsupervised Semantic Parsing (USP) files, or you can create your own semantic files. Ask your administrator to upload custom knowledge files to Oracle Analytics. When you enrich datasets, Oracle Analytics presents enrichment recommendations based on this semantic data. When you create workbooks, you can also include knowledge enrichments in your visualizations by adding them from the Data Panel.

Creating Your Own Custom Knowledge Files

When you create your own semantic files, follow these guidelines:

  • Create a data file in CSV or Microsoft Excel (XLSX) format.
  • Populate the first column with the key, which Oracle Analytics uses to profile the data.
  • Populate the other columns with the enrichment values.

Ask your administrator to upload your custom knowledge file to Oracle Analytics.