Data Profiles and Semantic Recommendations

When you create a dataset, Oracle Analytics performs column-level profiling to produce a set of semantic recommendations to repair or enrich your data. When you create workbooks, you can also include knowledge enrichments in your visualizations by adding them from the Data Panel.

Note:

Knowledge enrichments are usually enabled by default, but workbook editors can enable or disable them for datasets that they own or have editing privileges for. Oracle Analytics doesn't automatically provide enrichment recommendations for datasets generated from a data flow. In this case, the dataset owner or administrator must first enable the knowledge enrichments option for the dataset. See Enable Knowledge Enrichments for Datasets.

These recommendations are based on the system automatically detecting a specific semantic type during the profile step. For example, datasets based on local subject areas are profiled using a simple Top N sample.

There are categories of semantic types such as geographic locations identified by city names, recognizable patterns as in credit cards, email addresses and social security numbers, dates, and recurring patterns. You can also create your own custom semantic types.

Topics:

Semantic Type Categories

Profiling is applied to various semantic types.

Semantic type categories are profiled to identify:

Geographic locations such as city names.
Patterns such as those found with credit cards numbers or email addresses.
Recurring patterns such as hyphenated phrase data.

Semantic Type Recommendations

Recommendations to repair, enhance, or enrich the dataset, are determined by the type of data.

Examples of semantic type recommendations:

Enrichments - Adding a new column to your data that corresponds to a specific detected type, such as a geographic location. For example, adding population data for a city.
Column Concatenations - When two columns are detected in the dataset, one containing first names and the other containing last names, the system recommends concatenating the names into a single column. For example, a first_name_last_name column.
Semantic Extractions - When a semantic type is composed of subtypes, for example a us_phone number that includes an area code, the system recommends extracting the subtype into its own column.
Part Extraction - When a generic pattern separator is detected in the data, the system recommends extracting parts of that pattern. For example if the system detects a repeating hyphenation in the data, it recommends extracting the parts into separate columns to potentially make the data more useful for analysis.
Date Extractions - When dates are detected, the system recommends extracting parts of the date that might augment the analysis of the data. For example, you might extract the day of week from an invoice or purchase date.
Full and Partial Obfuscation/Masking/Delete - When sensitive fields are detected such as a credit card number, the system recommends a full or partial masking of the column, or even removal.

Recognized Pattern-Based Semantic Types

Semantic types are identified based on patterns found in your data.

Recommendations are provided for these semantic types:

Dates (in more than 30 formats)
US Social Security Numbers (SSN)
Credit Card Numbers
Credit Card Attributes (CVV and Expiration Date)
Email Addresses
North American Plan Phone Numbers
US Addresses

Reference-Based Semantic Types

Recognition of semantic types is determined by loaded reference knowledge provided with the service.

Reference-based recommendations are provided for these semantic types:

Country names
Country codes
State names (Provinces)
State codes
County names (Jurisdictions)
City names (Localized Names)
Zip codes

Recommended Enrichments

Recommended enrichments are based on the semantic types.

Enrichments are determined based on the geographic location hierarchy:

Country
Province (State)
Jurisdiction (County)
Longitude
Latitude
Population
Elevation (in Meters)
Time zone
ISO country codes
Federal Information Processing Series (FIPS)
Country name
Capital
Continent
GeoNames ID
Languages spoken
Phone country code
Postal code format
Postal code pattern
Phone country code
Currency name
Currency abbreviation
Geographic top-level domain (GeoLTD)
Square KM

Required Thresholds

The profiling process uses specific thresholds to decide about specific semantic types.

As a general rule, 85% of the data values in the column must meet the criteria for a single semantic type in order for Oracle Analytics to determine the classification. For example, a column with 70% first names and 30% “other” doesn't meet the threshold requirements and therefore no recommendations are made.

Custom Knowledge Recommendations

Use custom knowledge recommendations to augment the Oracle Analytics system knowledge. Custom knowledge enables the Oracle Analytics semantic profiler to identify more business-specific semantic types and make more relevant and governed enrichment recommendations. For example, you might add a custom knowledge reference that classifies prescription medication into USP drug categories Analgesics or Opioid.

Tutorial