About the Data Enrichment modules

The Data Enrichment modules increase the usability of your data by discovering value in its content.

Bundled in the Data Enrichment package is a collection of modules along with the logic to associate these modules with a column of data (for example, an address column can be detected and associated with a GeoTagger module).

During the sampling phase of the Data Processing workflow, some of the Data Enrichment modules run automatically while others do not. If you run a workflow with the DP CLI, you can use the --excludePlugins flag to specify which modules should not be run.

After a data set has been created, you can run any module from Studio's Transform page.

Pre-screening of input

When Data Processing is running against a Hive table, the Data Enrichment modules that run automatically obtain their input pre-screened by the sampling stage. For example, only an IP address is ever passed to the IP Address GeoTagger module.

Attributes that are ignored

All Data Enrichment modules ignore both the primary-key attribute of a record and any attribute whose data type is inappropriate for that module. For example, the Entity extractor works only on string attributes, so that numeric attributes are ignored. In addition, multi-assign attributes are ignored for auto-enrichment.

Sampling strategy for the modules

When Data Processing runs (for example, during a full data ingest), each module runs only under the following conditions during the sampling phase:
  • Entity: never runs automatically.
  • TF-IDF: runs only if the text contains between 35 and 30,000 tokens.
  • Sentiment Analysis (both document level and sub-document level) : never runs automatically
  • Address GeoTagger: runs only on well-formed addresses. Note that the GeoTagger sub-modules (City/Region/Sub-Region/Country) never run automatically.
  • IP Address GeoTagger: runs only on IPV4 type addresses (does not run on private IP addresses and does not run on automatically on IPV6 type addresses).
  • Reverse GeoTagger: only runs on valid geocode formats.
  • Boilerplate Removal: never runs automatically.
  • Tag Stripper: never runs automatically.
  • Phonetic Hash: never runs automatically.
  • Language Detection: runs only if the input text is at least 30 words long. This module is enabled for tokens in the range 30 to 30,000 tokens.

Note that when the Data Processing workflow finishes, you can manually run any of these modules from Transform in Studio.

Supported languages

The supported languages are specific to each module. For details, see the topic for the module.

Output attribute names

The types and names of output attributes are specific to each module. For details on output attributes, see the topic for the module.

Data Enrichment logging

If Data Enrichment modules are run in a workflow, they are logged as part of the YARN log. The log entries described which module was run and the columns (attributes) created by the modules.

For example, a data set that contains many geocode values can be produce the following log entries:
Running enrichments (if any)..
generate plugin recommendations and auto enrich transform script
TOTAL AVAILABLE PLUGINS: 12
SampleValuedRecommender::Registering Plugin: AddressGeoTaggerUDF
SampleValuedRecommender::Registering Plugin: IPGeoExtractorUDF
SampleValuedRecommender::Registering Plugin: ReverseGeoTaggerUDF
SampleValuedRecommender::Registering Plugin: LanguageDetectionUDF
SampleValuedRecommender::Registering Plugin: DocLevelSentimentAnalysisUDF
SampleValuedRecommender::Registering Plugin: BoilerPlateRemovalUDF
SampleValuedRecommender::Registering Plugin: TagStripperUDF
SampleValuedRecommender::Registering Plugin: TFIDFTermExtractorUDF
SampleValuedRecommender::Registering Plugin: EntityExtractionUDF
SampleValuedRecommender::Registering Plugin: SubDocLevelSentimentAnalysisUDF
SampleValuedRecommender::Registering Plugin: PhoneticHashUDF
SampleValuedRecommender::Registering Plugin: StructuredAddressGeoTaggerUDF
valid input string count=0, total input string count=101, success ratio=0.0
AddressGeotagger won't be invoked since the success ratio is < 80%
SampleValuedRecommender: --- [ReverseGeoTaggerUDF] plugin RECOMMENDS column: [latlong] for Enrichment, based on 101 samples
SampleValuedRecommender: --- new enriched column 'latlong_geo_city' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_country' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_postcode' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_region' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_subregion' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_regionid' will be created from 'latlong'
SampleValuedRecommender: --- new enriched column 'latlong_geo_subregionid' will be created from 'latlong'

In the example, the Reverse GeoTagger created seven columns.