Enrichment functions are based on Data Enrichment modules used as part of data processing in Big Data Discovery. You can use these functions to extract meaningful information from your data and modify attributes to make them more useful for analysis.
The same functions are described in the Transform API Reference (Groovydoc).
More information on the Data Enrichment modules is available in the Data Processing Guide.
Finds the language of a given String attribute and returns an Oracle language code (for example, es for Spanish). For accurate results, the text should contain at least ten words.
detectLanguage(String attribute)where:
The results are returned in a single-assign attribute.
detectLanguage(labor_description)might return "en" for the labor_description String attribute.
Extracts key phrases from a String attribute and returns a list of phrases in a multi-assign attribute. The function calculates key phrases using the TF/IDF algorithm, which takes the total number of times each term appears within the String and offsets that value by the number of times it appears within a larger body of work. Offsetting the value helps filter out frequently-used terms like "the" and "it". The body of work used as the control is selected internally based on the String's language; for example, the model used for English is based on a New York Times corpus. The extractKeyPhrases function is a wrapper function for the TF/IDF Term extractor enrichment module.
The number of key phrases returned by extractKeyPhrases is a function of the TF/IDF curve. By default, it stops returning terms when the score of a given term falls below ~68%.
extractKeyPhrases(String attribute, String languageCode, Boolean smartCasing)where:
extractKeyPhrases(toLowerCase(comments)) extractKeyPhrases(surveys, 'en', true)
When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.
Returns a String containing noun groups. A noun group is any noun, such as "movie" or "building". This is a wrapper function for the Noun Group Extractor enrichment module. This module finds and returns noun groups from a String attribute in each of the supported languages. It is used in tag cloud visualization, for finding commonly occurring themes in the data.
extractNounGroups(String attribute, String languageCode)where:
extractNounGroups(labor_description, 'en')would return nouns (such as "Battery" or "Mass Air Flow Sensor") for the labor_description String attribute.
When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.
Uses a dictionary-matching algorithm that locates elements of a finite set of Strings (the whitelist) within input text. The function finds all occurrences of any whitelist terms and returns a list of matching expansions. The input text is matched against a whitelist. A whitelist is newline-delimited. This is a wrapper function for the Whitelist Tagger enrichment module.
Each line may be either a comment (indicated with a # as the first character), or a matching directive comprised of either one or two values (separated by the delimiter character). The second value is used to rewrite the match output.
It could be rewritten as follows, with a forward slash (/) as the delimiter:
When this whitelist is run on the text "The only noble gas is radon", it would produce an output list of ['Rn']
extractWhiteListTags(String attribute, String whitelist, String languageCode,
boolean caseSensitive, boolean matchWholeWords, String delimiter)
where:
delimiter=',' Rn,86 Ne,10 He,2 delimiter='/' Rn/86 Ne/10 He/2 no delimiter specified (uses the default <tab> character) Rn<tab>86 Ne<tab>10 He<tab>2
def whitelist = ''' helium/He neon/Ne argon/Ar krypton/Kr xenon/Xe radon/Rn ''' def document = 'The noble gases make a group of chemical elements with similar properties: under standard conditions, they are all odorless, colorless, monatomic gases with very low chemical reactivity. The six noble gases that occur naturally are helium (He), neon (Ne), argon (Ar), krypton (Kr), xenon (Xe), and the radioactive RADON (Rn).' extractWhitelistTags(document, whitelist, 'en', false, true, '/')
Note that the language specified is English (en), the matches are not case-sensitive (false), and unbounded, thus match the whole words only (false). The '/' delimiter is used for parsing the whitelist.
Converts an IP address to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected. This is a wrapper function for the IP Address GeoTagger data enrichment module.
geotagIPAddress(String IPAddress, String adminLevel)where:
geotagIPAddress('148.86.25.54', 'City')
geotagIPAddress('148.86.25.54', ADMIN_LEVEL_CITY)
Both examples return "New York City" as a single-assign string
attribute.
Converts an IP address to a Geocode and returns its geocode field as an Object. This is a wrapper function for the IP Address GeoTagger data enrichment module that returns a single attribute as a Geocode type.
geotagIPAddressGetGeocode(String IPAddress)where:
geotagIPAddressGetGeocode('148.86.25.54')
Returns a geocode of "40.714270 -74.005970" as a single-assign
Geocode attribute.
Geotags an address, based on structured fields.
geotagStructuredAddress(String country, String region, String subregion, String city,
String postcode, Boolean returnByPopulation, String adminLevel)
where:
Note that the adminLevel parameter is the only optional parameter and therefore is the only parameter that can be omitted. All other parameters must be specified, either with a value or as null.
If the address resolves to a number of locations and returnByPopulation is true, the function will pick the location with the largest population and then return its geocode.
// Get the geocode for San Francisco in the US and return the location with the largest population. geotagStructuredAddress( 'us', null, null, 'san francisco', null, true, 'Geocode')Returns "39.76 -98.5" (the geocode of San Francisco, California).
// Get the region (state in the US) in which the Boston with the highest population is located.
geotagStructuredAddress('us', '', '', 'boston', '', true, 'Region')
Returns "Massachusetts" (because Boston, Massachusetts has the
largest population of all the Boston locations in the US).
Converts a valid address in a String attribute to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected.
This is a wrapper function for the Address GeoTagger data enrichment module. It adds a multi-assign attribute (column) to your data set that contains the Geocode address.
geotagUnstructuredAddress(String addressText, String adminLevel, String addressGrain, Boolean validateAddress)where:
geotagUnstructuredAddress(countries, 'Country', , true, 'en')
geotagUnstructuredAddress('New York, NY 10029', 'Region', 'SubRegion', false)
Converts an IP address to a Geocode and returns its geocode field as an Object. This is a wrapper function for the Address GeoTagger module.
geotagUnstructuredAddressGetGeocode(String addressText, String addressGrain, Boolean validateAddress)where:
geotagUnstructuredAddressGetGeocode(cities, ADMIN_LEVEL_CITY, false) geotagUnstructuredAddressGetGeocode(cities, 'City', false)
Returns all entities, of a specified type, from an input String attribute. The entities are returned as a list of Strings. This function creates a new multi-assign column in your data set for the entity results. This is a wrapper function for the Named Entity Recognition extractor module. Supports only English input text.
getEntities(String attribute, String entityType)where:
getEntities(claims, ENTITY_TYPE_LOCATION) getEntities(reviews, 'Person')
This is a wrapper function for the Sentiment Analysis (document level) data enrichment module.
getSentiment(String textAttribute, String languageCode)where:
getSentiment(comments, 'English')
In the example, "comments" is a String attribute.
Extracts phrases in sentences that have a positive or negative sentiment. The function call will specify the type of phrases to be extracted and which type of sentiment. A list of the desired terms (as Strings) is returned.
getTermSentiment(String textAttribute, String termAttribute, String sentimentCategory, String languageCode)where:
getTermSentiment(comments, 'KeyPhrases', 'Positive') getTermSentiment(comments, KEY_PHRASES, SENTIMENT_POSITIVE) getTermSentiment(companies, 'Organization', 'Negative') getTermSentiment(companies, ENTITY_TYPE_ORGANIZATION, SENTIMENT_NEGATIVE) getTermSentiment(reviews, 'NounGroups', 'Positive') getTermSentiment(reviews, NOUN_GROUPS, SENTIMENT_POSITIVE)
Returns the geocode address for a specified administrative division. Searches for the administrative division within the specified radius from the entered Geocode. This is a wrapper function for the Reverse GeoTagger data enrichment module that returns a single value.
reverseGeotag(Geocode geoAttribute, String adminLevel, Double proximityThreshold)where:
reverseGeotag(toGeocode(42.35843, -71.05977), 'CITY', 'en', 50)
Runs a custom, external Groovy script as defined in an external file of pluginName and returns the result of the script.
runExternalPlugin(String pluginName, String attribute, Map options)where:
For information on creating the external plug-in, see Extending the transform function library.
Removes any HTML, XML and XHTML markup tags from the input String and returns the result as a String. This is a wrapper function for the Tag Stripper data enrichment module.
stripTagsFromHTML(String attribute)where:
The function returns plain text.
Produces a String hash of the input text (English only) that represents the phonetics of the text.
A word's phonetic hash is based on its pronunciation, rather than its spelling. One application for phonetic hashes is search engines. If a search term does not return any results, the search engine can compare the term's phonetic hash to the hashes of other terms and return results for the term that is the best fit. For example, "purple" and "pruple" have the same phonetic hash (PRPL), so a search for the misspelled term "pruple" would still yield results for "purple".
toPhoneticHash(String attribute)where:
toPhoneticHash(terms)
In the example, "terms" is a String attribute.