The Data Enrichment modules increase the usability of your data by discovering value in its content.
Bundled in the Data Enrichment package is a collection of modules along with the logic to associate these modules with a column of data (for example, an address column can be detected and associated with a GeoTagger module).
During the sampling phase of the Data Processing workflow, some of the Data Enrichment modules run automatically while others do not. If you run a workflow with the DP CLI, you can use the --excludePlugins flag to specify which modules should not be run.
After a data set has been created, you can run any module from Studio's Transform page.
Pre-screening of input
When Data Processing is running against a Hive table, the Data Enrichment modules that run automatically obtain their input pre-screened by the sampling stage. For example, only an IP address is ever passed to the IP Address GeoTagger module.
Attributes that are ignored
All Data Enrichment modules ignore both the primary-key attribute of a record and any attribute whose data type is inappropriate for that module. For example, the Entity extractor works only on string attributes, so that numeric attributes are ignored. In addition, multi-assign attributes are ignored for auto-enrichment.
Sampling strategy for the modules
Note that when the Data Processing workflow finishes, you can manually run any of these modules from Transform in Studio.
Supported languages
The supported languages are specific to each module. For details, see the topic for the module.
Output attribute names
The types and names of output attributes are specific to each module. For details on output attributes, see the topic for the module.
Data Enrichment logging
If Data Enrichment modules are run in a workflow, they are logged as part of the YARN log. The log entries described which module was run and the columns (attributes) created by the modules.
Running enrichments (if any).. generate plugin recommendations and auto enrich transform script TOTAL AVAILABLE PLUGINS: 12 SampleValuedRecommender::Registering Plugin: AddressGeoTaggerUDF SampleValuedRecommender::Registering Plugin: IPGeoExtractorUDF SampleValuedRecommender::Registering Plugin: ReverseGeoTaggerUDF SampleValuedRecommender::Registering Plugin: LanguageDetectionUDF SampleValuedRecommender::Registering Plugin: DocLevelSentimentAnalysisUDF SampleValuedRecommender::Registering Plugin: BoilerPlateRemovalUDF SampleValuedRecommender::Registering Plugin: TagStripperUDF SampleValuedRecommender::Registering Plugin: TFIDFTermExtractorUDF SampleValuedRecommender::Registering Plugin: EntityExtractionUDF SampleValuedRecommender::Registering Plugin: SubDocLevelSentimentAnalysisUDF SampleValuedRecommender::Registering Plugin: PhoneticHashUDF SampleValuedRecommender::Registering Plugin: StructuredAddressGeoTaggerUDF valid input string count=0, total input string count=101, success ratio=0.0 AddressGeotagger won't be invoked since the success ratio is < 80% SampleValuedRecommender: --- [ReverseGeoTaggerUDF] plugin RECOMMENDS column: [latlong] for Enrichment, based on 101 samples SampleValuedRecommender: --- new enriched column 'latlong_geo_city' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_country' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_postcode' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_region' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_subregion' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_regionid' will be created from 'latlong' SampleValuedRecommender: --- new enriched column 'latlong_geo_subregionid' will be created from 'latlong'
In the example, the Reverse GeoTagger created seven columns.