Enrichment functions are based on Data Enrichment modules used as part of data processing in Big Data Discovery. You can use these functions to extract meaningful information from your data and modify attributes to make them more useful for analysis.
The same functions are described in the Transform API Reference (Groovydoc).
More information on the Data Enrichment modules is available in the Data Processing Guide.
detectLanguage
Finds the language of a given String attribute and returns an Oracle language code (for example, es for Spanish). For accurate results, the text should contain at least ten words.
detectLanguage(String attribute)where:
attribute
is the String attribute on which to perform language detection.The results are returned in a single-assign attribute.
detectLanguage(labor_description)might return "en" for the labor_description String attribute.
extractKeyPhrases
Extracts key phrases from a String attribute and returns a list of phrases in a multi-assign attribute. The function calculates key phrases using the TF/IDF algorithm, which takes the total number of times each term appears within the String and offsets that value by the number of times it appears within a larger body of work. Offsetting the value helps filter out frequently-used terms like "the" and "it". The body of work used as the control is selected internally based on the String's language; for example, the model used for English is based on a New York Times corpus. The extractKeyPhrases
function is a wrapper function for the TF/IDF Term extractor enrichment module.
The number of key phrases returned by extractKeyPhrases
is a function of the TF/IDF curve. By default, it stops returning terms when the score of a given term falls below ~68%.
extractKeyPhrases(String attribute, String languageCode, Boolean smartCasing)where:
attribute
is a String attribute that is to be processed. It is recommended that you convert the text to lowercase first, especially if it is in all caps.languageCode
is an optional String parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null
(this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.smartCasing
is an optional parameter that, when set to true
, specifies that the function automatically handle documents that are predominantly in either title case or upper case. If this parameter is not used, it defaults to true
.extractKeyPhrases(toLowerCase(comments)) extractKeyPhrases(surveys, 'en', true)
When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.
extractNounGroups
Returns a String containing noun groups. A noun group is any noun, such as "movie" or "building". This is a wrapper function for the Noun Group Extractor enrichment module. This module finds and returns noun groups from a String attribute in each of the supported languages. It is used in tag cloud visualization, for finding commonly occurring themes in the data.
extractNounGroups(String attribute, String languageCode)where:
attribute
is the String attribute to be processed.languageCode
is an optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null
(this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.extractNounGroups(labor_description, 'en')would return nouns (such as "Battery" or "Mass Air Flow Sensor") for the labor_description String attribute.
When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.
extractWhiteListTags
Uses a dictionary-matching algorithm that locates elements of a finite set of Strings (the whitelist) within input text. The function finds all occurrences of any whitelist terms and returns a list of matching expansions. The input text is matched against a whitelist. A whitelist is newline-delimited. This is a wrapper function for the Whitelist Tagger enrichment module.
Each line may be either a comment (indicated with a # as the first character), or a matching directive comprised of either one or two values (separated by the delimiter
character). The second value is used to rewrite the match output.
It could be rewritten as follows, with a forward slash (/) as the delimiter:
When this whitelist is run on the text "The only noble gas is radon", it would produce an output list of ['Rn']
extractWhiteListTags(String attribute, String whitelist, String languageCode, boolean caseSensitive, boolean matchWholeWords, String delimiter)where:
attribute
is the String attribute to process.whitelist
is a document containing whitelisted entries. This should be a plain text file containing a newline-delimited list of literals and configuration terms.languageCode
is an optional String parameter that specifies the String's language to improve accuracy. Set to English by default. Supported languages are whitespace-delimited languages only.caseSensitive
is an optional Boolean parameter that indicates whether input is case-sensitive (the default is false
).matchWholeWords
is an optional Boolean parameter that indicates whether to match whole words only (when set to false
, which is the default), or parts of words (when set to true
). Ensures that "red" does not match "reduce".delimiter
is an optional String parameter that specifies the delimiter character used to parse a whitelist entry into the "match" and "output" values. The TAB character (\t
) is the default. Note that each whitelist entry can use only one delimiter (all delimiters after the first one are ignored).delimiter
character does is to separate a whitelist entry into its match and output values, as in these examples:
delimiter=',' Rn,86 Ne,10 He,2 delimiter='/' Rn/86 Ne/10 He/2 no delimiter specified (uses the default <tab> character) Rn<tab>86 Ne<tab>10 He<tab>2
tagList
, the second line defines a document, and the thire line uses the extractWhiteListTags
Transform enrichment function, which first matches the input text against the specified whitelist. Next, finds and extracts all occurrences of any terms listed in the whitelist (in English), as WhitelistTags
, and then returns a list of matching expansions:
def whitelist = ''' helium/He neon/Ne argon/Ar krypton/Kr xenon/Xe radon/Rn ''' def document = 'The noble gases make a group of chemical elements with similar properties: under standard conditions, they are all odorless, colorless, monatomic gases with very low chemical reactivity. The six noble gases that occur naturally are helium (He), neon (Ne), argon (Ar), krypton (Kr), xenon (Xe), and the radioactive RADON (Rn).' extractWhitelistTags(document, whitelist, 'en', false, true, '/')
Note that the language specified is English (en
), the matches are not case-sensitive (false
), and unbounded, thus match the whole words only (false
). The '/' delimiter is used for parsing the whitelist.
geotagIPAddress
Converts an IP address to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected. This is a wrapper function for the IP Address GeoTagger data enrichment module.
geotagIPAddress(String IPAddress, String adminLevel)where:
IPAddress
is a valid IP address to process, in type String.adminLevel
is an optional String parameter that specifies an administrative division to return. This can be set to only one of the following constant or literal values (case-sensitive):
ADMIN_LEVEL_CITY
or 'City'
for a city match.ADMIN_LEVEL_COUNTRY
or 'Country'
for a country match.ADMIN_LEVEL_REGION
or 'Region'
for a region match, such as a state in the United States.ADMIN_LEVEL_REGIONID
or 'RegionID'
for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
for a sub-region match, such as a county in the United States.ADMIN_LEVEL_SUBREGIONID
or 'SubRegionID'
for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.ADMIN_LEVEL_POSTCODE
or 'Postcode'
for a postal code, such as a zip code in the US.adminLevel
are:
geotagIPAddress('148.86.25.54', 'City') geotagIPAddress('148.86.25.54', ADMIN_LEVEL_CITY)Both examples return "New York City" as a single-assign string attribute.
geotagIPAddressGetGeocode
Converts an IP address to a Geocode and returns its geocode
field as an Object. This is a wrapper function for the IP Address GeoTagger data enrichment module that returns a single attribute as a Geocode type.
geotagIPAddressGetGeocode(String IPAddress)where:
IPAddress
is a valid IP address to process, in type String.geotagIPAddressGetGeocode('148.86.25.54')Returns a geocode of "40.714270 -74.005970" as a single-assign Geocode attribute.
geotagStructuredAddress
Geotags an address, based on structured fields.
geotagStructuredAddress(String country, String region, String subregion, String city, String postcode, Boolean returnByPopulation, String adminLevel)where:
country
is the field for the country of the address (use null when unknown).region
is the field for the region of the address (use null when unknown). A region would be a state in the US.subregion
is the field for the sub-region of the address (use null when unknown). A sub-region would be a country in the US.city
is the field for the city of the address (use null when unknown).postcode
is the field for the postal code of the address (use null when unknown). A postal code would be a zip code in the US.returnByPopulation
is an optional Boolean parameter that, if set to true, returns the location with largest population. The default is false.adminLevel
is an optional String parameter that specifies a specific field to return. This can be set to only one of the following constant or literal values (case-sensitive):
ADMIN_LEVEL_CITY
or 'City'
returns the city of the address.ADMIN_LEVEL_COUNTRY
or 'Country'
returns the country of the address.ADMIN_LEVEL_REGION
or 'Region'
returns the region of the address, such as a state in the United States.ADMIN_LEVEL_REGIONID
or 'RegionID'
returns the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
returns the sub-region of the address, such as a county in the United States.ADMIN_LEVEL_SUBREGIONID
or 'SubRegionID'
returns the ID of the sub-region in the GeoNames database, such as "4952349" for Suffolk Country in Massachusetts.ADMIN_LEVEL_POSTCODE
or 'Postcode'
returns the postal code of the address.ADMIN_LEVEL_GEOCODE
or 'Geocode'
returns the geocode of the least hierarchical administrative level. This is the default.Note that the adminLevel
parameter is the only optional parameter and therefore is the only parameter that can be omitted. All other parameters must be specified, either with a value or as null.
adminLevel
parameter. The returned data types for the adminLevel
are:
If the address resolves to a number of locations and returnByPopulation
is true, the function will pick the location with the largest population and then return its geocode.
// Get the geocode for San Francisco in the US and return the location with the largest population. geotagStructuredAddress( 'us', null, null, 'san francisco', null, true, 'Geocode')Returns "39.76 -98.5" (the geocode of San Francisco, California).
// Get the region (state in the US) in which the Boston with the highest population is located. geotagStructuredAddress('us', '', '', 'boston', '', true, 'Region')Returns "Massachusetts" (because Boston, Massachusetts has the largest population of all the Boston locations in the US).
geotagUnstructuredAddress
Converts a valid address in a String attribute to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected.
This is a wrapper function for the Address GeoTagger data enrichment module. It adds a multi-assign attribute (column) to your data set that contains the Geocode address.
geotagUnstructuredAddress(String addressText, String adminLevel, String addressGrain, Boolean validateAddress)where:
addressText
is the address String to process. This must be less than or equal to 350 characters.adminLevel
is an optional String parameter that specifies an administrative division to return. This can be set to only one of the following constant or literal values (case-sensitive):
ADMIN_LEVEL_CITY
or 'City'
for a city match.ADMIN_LEVEL_COUNTRY
or 'Country'
for a country match.ADMIN_LEVEL_REGION
or 'Region'
for a region match, such as a state in the United States.ADMIN_LEVEL_REGIONID
or 'RegionID'
for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
for a sub-region match, such as a county in the United States.ADMIN_LEVEL_SUBREGIONID
or 'SubRegionID'
for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.ADMIN_LEVEL_POSTCODE
or 'Postcode'
for a postal code, such as a zip code in the US.addressGrain
is an optional String parameter that specifies an administrative division to help the GeoTagger find the most likely match for a given level. This can be set to only one of the following constant or literal values (case-sensitive):
ADMIN_LEVEL_CITY
or 'City'
for a city match.ADMIN_LEVEL_COUNTRY
or 'Country'
for a country match.ADMIN_LEVEL_REGION
or 'Region'
for a region match.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
for a sub-region match.ADMIN_LEVEL_NONE
or 'None'
returns the most populous location that most closely matches the address String. This is the default value.validateAddress
is an optional Boolean parameter that specifies whether the GeoTagger should validate the address.adminLevel
are:
geotagUnstructuredAddress(countries, 'Country', , true, 'en') geotagUnstructuredAddress('New York, NY 10029', 'Region', 'SubRegion', false)
geotagUnstructuredAddressGetGeocode
Converts an IP address to a Geocode and returns its geocode
field as an Object. This is a wrapper function for the Address GeoTagger module.
geotagUnstructuredAddressGetGeocode(String addressText, String addressGrain, Boolean validateAddress)where:
addressText
is the address String to process. This must be less than or equal to 350 characters.addressGrain
is an optional String parameter that helps the GeoTagger to find the most likely match for a given level. This can be set to only one of the following values (case-insensitive):
ADMIN_LEVEL_CITY
or 'City'
for a city match.ADMIN_LEVEL_COUNTRY
or 'Country'
for a country match.ADMIN_LEVEL_REGION
or 'Region'
for a region match, such as a state in the United States.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
for a sub-region match, such as a county in the United States.ADMIN_LEVEL_NONE
or 'None'
returns the most populous location that most closely matches the address String. This is the default value.validateAddress
is an optional Boolean parameter that specifies whether the GeoTagger should validate the address.geotagUnstructuredAddressGetGeocode(cities, ADMIN_LEVEL_CITY, false) geotagUnstructuredAddressGetGeocode(cities, 'City', false)
getEntities
Returns all entities, of a specified type, from an input String attribute. The entities are returned as a list of Strings. This function creates a new multi-assign column in your data set for the entity results. This is a wrapper function for the Named Entity Recognition extractor module. Supports only English input text.
getEntities(String attribute, String entityType)where:
attribute
specifies the String attribute that is to be processed.entityType
is a String parameter that specifies the type of entity to extract. You can specify only one of the following constant or literal values (case-sensitive):
ENTITY_TYPE_PERSON
or 'Person'
returns all the person entities found in the attribute.ENTITY_TYPE_ORGANIZATION
or 'Organization'
returns all the organization entities found in the attribute.ENTITY_TYPE_LOCATION
or 'Location'
returns all the location entities found in the attribute. Location entities are names of places, such as "Boston" or "Canada".getEntities(claims, ENTITY_TYPE_LOCATION) getEntities(reviews, 'Person')
getSentiment
POSITIVE
NEGATIVE
This is a wrapper function for the Sentiment Analysis (document level) data enrichment module.
getSentiment(String textAttribute, String languageCode)where:
attribute
specifies the String attribute that is to be processed.languageCode
is an optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null
(this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.getSentiment(comments, 'English')
In the example, "comments" is a String attribute.
getTermSentiment
Extracts phrases in sentences that have a positive or negative sentiment. The function call will specify the type of phrases to be extracted and which type of sentiment. A list of the desired terms (as Strings) is returned.
getTermSentiment(String textAttribute, String termAttribute, String sentimentCategory, String languageCode)where:
textAttribute
is the String attribute to process.termAttribute
is a String parameter that specifies the type of terms to extract, based on their sentiment (as set by the sentimentCategory
argument). You can specify only one of the following values (case-sensitive) for the term type:
ENTITY_TYPE_PERSON
or 'Person'
locates passages that contain person entities and returns the sentiment of those passages.ENTITY_TYPE_ORGANIZATION
or 'Organization'
locates passages that contain organization entities and returns the sentiment of those passages.ENTITY_TYPE_LOCATION
or 'Location'
locates passages that contain location entities and returns the sentiment of those passages.NOUN_GROUPS
or 'NounGroups'
extracts noun groups in sentences, based on their specified sentiment.KEY_PHRASES
or 'KeyPhrases'
extracts key phrases in sentences, based on their specified sentiment.sentimentCategory
is specifies the type of sentiment to be considered for the terms. You can specify SENTIMENT_POSITIVE
(or 'Positive'
) for negative for positive sentiment or SENTIMENT_NEGATIVE
(or 'Negative'
) for negative sentiment. All values are case-sensitive.languageCode
. An optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified it forces the function to use a model specific to that language. When not specified, or when passed as null
(this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.getTermSentiment(comments, 'KeyPhrases', 'Positive') getTermSentiment(comments, KEY_PHRASES, SENTIMENT_POSITIVE) getTermSentiment(companies, 'Organization', 'Negative') getTermSentiment(companies, ENTITY_TYPE_ORGANIZATION, SENTIMENT_NEGATIVE) getTermSentiment(reviews, 'NounGroups', 'Positive') getTermSentiment(reviews, NOUN_GROUPS, SENTIMENT_POSITIVE)
reverseGeotag
Returns the geocode address for a specified administrative division. Searches for the administrative division within the specified radius from the entered Geocode. This is a wrapper function for the Reverse GeoTagger data enrichment module that returns a single value.
reverseGeotag(Geocode geoAttribute, String adminLevel, Double proximityThreshold)where:
geoAttribute
is the Geocode to process.adminLevel
is String parameter that specifies an administrative division to return. This can be set to only one of these constant or literal values (case-sensitive):
ADMIN_LEVEL_CITY
or 'City'
for a city match.ADMIN_LEVEL_COUNTRY
or 'Country'
for a country match.ADMIN_LEVEL_REGION
or 'Region'
for a region match, such as a state in the United States.ADMIN_LEVEL_REGIONID
or 'RegionID'
for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.ADMIN_LEVEL_SUBREGION
or 'SubRegion'
for a sub-region match, such as a county in the United States.ADMIN_LEVEL_SUBREGIONID
or 'SubRegionID'
for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.ADMIN_LEVEL_POSTCODE
or 'Postcode'
for a postal code, such as a zip code in the US.proximityThreshold
is an optional Double parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.adminLevel
parameter. The returned data types for the adminLevel
are:
city
field:
reverseGeotag(toGeocode(42.35843, -71.05977), 'CITY', 'en', 50)
runExternalPlugin
Runs a custom, external Groovy script as defined in an external file of pluginName
and returns the result of the script.
runExternalPlugin(String pluginName, String attribute, Map options)where:
pluginName
is the name (base name and extension) of the Groovy script file (for example, MyPlugin.groovy).attribute
is the input String passed to the script.options
is an options Map, which contains any options to be used by the Groovy script. The default is to be empty.For information on creating the external plug-in, see Extending the transform function library.
stripTagsFromHTML
Removes any HTML, XML and XHTML markup tags from the input String and returns the result as a String. This is a wrapper function for the Tag Stripper data enrichment module.
stripTagsFromHTML(String attribute)where:
attribute
is the HTML String to process.The function returns plain text.
toPhoneticHash
Produces a String hash of the input text (English only) that represents the phonetics of the text.
A word's phonetic hash is based on its pronunciation, rather than its spelling. One application for phonetic hashes is search engines. If a search term does not return any results, the search engine can compare the term's phonetic hash to the hashes of other terms and return results for the term that is the best fit. For example, "purple" and "pruple" have the same phonetic hash (PRPL), so a search for the misspelled term "pruple" would still yield results for "purple".
toPhoneticHash(String attribute)where:
attribute
is the String attribute to process.toPhoneticHash(terms)
In the example, "terms" is a String attribute.