Enrichment functions

Enrichment functions are based on Data Enrichment modules used as part of data processing in Big Data Discovery. You can use these functions to extract meaningful information from your data and modify attributes to make them more useful for analysis.

The same functions are described in the Transform API Reference (Groovydoc).

More information on the Data Enrichment modules is available in the Data Processing Guide.

Transform supports the following enrichment functions:

detectLanguage

Finds the language of a given String attribute and returns an Oracle language code (for example, es for Spanish). For accurate results, the text should contain at least ten words.

The syntax is:

detectLanguage(String attribute)

where:

attribute is the String attribute on which to perform language detection.

The results are returned in a single-assign attribute.

Example:

detectLanguage(labor_description)

might return "en" for the labor_description String attribute.

extractKeyPhrases

Extracts key phrases from a String attribute and returns a list of phrases in a multi-assign attribute. The function calculates key phrases using the TF/IDF algorithm, which takes the total number of times each term appears within the String and offsets that value by the number of times it appears within a larger body of work. Offsetting the value helps filter out frequently-used terms like "the" and "it". The body of work used as the control is selected internally based on the String's language; for example, the model used for English is based on a New York Times corpus. The extractKeyPhrases function is a wrapper function for the TF/IDF Term extractor enrichment module.

The number of key phrases returned by extractKeyPhrases is a function of the TF/IDF curve. By default, it stops returning terms when the score of a given term falls below ~68%.

The syntax is:

extractKeyPhrases(String attribute, String languageCode, Boolean smartCasing)

where:

attribute is a String attribute that is to be processed. It is recommended that you convert the text to lowercase first, especially if it is in all caps.
languageCode is an optional String parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.
smartCasing is an optional parameter that, when set to true, specifies that the function automatically handle documents that are predominantly in either title case or upper case. If this parameter is not used, it defaults to true.

Examples:

extractKeyPhrases(toLowerCase(comments))

extractKeyPhrases(surveys, 'en', true)

When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.

extractNounGroups

Returns a String containing noun groups. A noun group is any noun, such as "movie" or "building". This is a wrapper function for the Noun Group Extractor enrichment module. This module finds and returns noun groups from a String attribute in each of the supported languages. It is used in tag cloud visualization, for finding commonly occurring themes in the data.

The syntax is:

extractNounGroups(String attribute, String languageCode)

where:

attribute is the String attribute to be processed.
languageCode is an optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.

Example:

extractNounGroups(labor_description, 'en')

would return nouns (such as "Battery" or "Mass Air Flow Sensor") for the labor_description String attribute.

When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.

extractWhiteListTags

Uses a dictionary-matching algorithm that locates elements of a finite set of Strings (the whitelist) within input text. The function finds all occurrences of any whitelist terms and returns a list of matching expansions. The input text is matched against a whitelist. A whitelist is newline-delimited. This is a wrapper function for the Whitelist Tagger enrichment module.

Each line may be either a comment (indicated with a # as the first character), or a matching directive comprised of either one or two values (separated by the delimiter character). The second value is used to rewrite the match output.

Here is a simple example whitelist:

helium
neon
argon
krypton
xenon
radon

It could be rewritten as follows, with a forward slash (/) as the delimiter:

helium/He
neon/Ne
argon/Ar
krypton/Kr
xenon/Xe
radon/Rn

When this whitelist is run on the text "The only noble gas is radon", it would produce an output list of ['Rn']

The syntax is:

extractWhiteListTags(String attribute, String whitelist, String languageCode,
      boolean caseSensitive, boolean matchWholeWords, String delimiter)

where:

attribute is the String attribute to process.
whitelist is a document containing whitelisted entries. This should be a plain text file containing a newline-delimited list of literals and configuration terms.
languageCode is an optional String parameter that specifies the String's language to improve accuracy. Set to English by default. Supported languages are whitespace-delimited languages only.
caseSensitive is an optional Boolean parameter that indicates whether input is case-sensitive (the default is false).
matchWholeWords is an optional Boolean parameter that indicates whether to match whole words only (when set to false, which is the default), or parts of words (when set to true). Ensures that "red" does not match "reduce".
delimiter is an optional String parameter that specifies the delimiter character used to parse a whitelist entry into the "match" and "output" values. The TAB character (\t) is the default. Note that each whitelist entry can use only one delimiter (all delimiters after the first one are ignored).

Note that what the delimiter character does is to separate a whitelist entry into its match and output values, as in these examples:

delimiter=','
Rn,86
Ne,10
He,2

delimiter='/'
Rn/86
Ne/10
He/2

no delimiter specified (uses the default <tab> character)
Rn<tab>86
Ne<tab>10
He<tab>2

In this extractWhiteListTags example, the first line defines a whitelist named tagList, the second line defines a document, and the thire line uses the extractWhiteListTags Transform enrichment function, which first matches the input text against the specified whitelist. Next, finds and extracts all occurrences of any terms listed in the whitelist (in English), as WhitelistTags, and then returns a list of matching expansions:

def whitelist = '''
helium/He
neon/Ne
argon/Ar
krypton/Kr
xenon/Xe
radon/Rn
'''
def document = 'The noble gases make a group of chemical elements with similar properties: 
under standard conditions, they are all odorless, colorless, monatomic gases with very low 
chemical reactivity. The six noble gases that occur naturally are helium (He), neon (Ne), 
argon (Ar), krypton (Kr), xenon (Xe), and the radioactive RADON (Rn).'

extractWhitelistTags(document, whitelist, 'en', false, true, '/')

Note that the language specified is English (en), the matches are not case-sensitive (false), and unbounded, thus match the whole words only (false). The '/' delimiter is used for parsing the whitelist.

geotagIPAddress

Converts an IP address to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected. This is a wrapper function for the IP Address GeoTagger data enrichment module.

The syntax is:

geotagIPAddress(String IPAddress, String adminLevel)

where:

IPAddress is a valid IP address to process, in type String.
adminLevel is an optional String parameter that specifies an administrative division to return. This can be set to only one of the following constant or literal values (case-sensitive):
- ADMIN_LEVEL_CITY or 'City' for a city match.
- ADMIN_LEVEL_COUNTRY or 'Country' for a country match.
- ADMIN_LEVEL_REGION or 'Region' for a region match, such as a state in the United States.
- ADMIN_LEVEL_REGIONID or 'RegionID' for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' for a sub-region match, such as a county in the United States.
- ADMIN_LEVEL_SUBREGIONID or 'SubRegionID' for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.
- ADMIN_LEVEL_POSTCODE or 'Postcode' for a postal code, such as a zip code in the US.

The returned data types for the adminLevel are:

String for adminLevel = City, Country, Postcode, Region, SubRegion, RegionID, SubRegionID
Geocode for adminLevel = Geocode

Examples:

geotagIPAddress('148.86.25.54', 'City')

geotagIPAddress('148.86.25.54', ADMIN_LEVEL_CITY)

Both examples return "New York City" as a single-assign string attribute.

geotagIPAddressGetGeocode

Converts an IP address to a Geocode and returns its geocode field as an Object. This is a wrapper function for the IP Address GeoTagger data enrichment module that returns a single attribute as a Geocode type.

The syntax is:

geotagIPAddressGetGeocode(String IPAddress)

where:

IPAddress is a valid IP address to process, in type String.

Example:

geotagIPAddressGetGeocode('148.86.25.54')

Returns a geocode of "40.714270 -74.005970" as a single-assign Geocode attribute.

geotagStructuredAddress

Geotags an address, based on structured fields.

The syntax is:

geotagStructuredAddress(String country, String region, String subregion, String city, 
      String postcode, Boolean returnByPopulation, String adminLevel)

where:

country is the field for the country of the address (use null when unknown).
region is the field for the region of the address (use null when unknown). A region would be a state in the US.
subregion is the field for the sub-region of the address (use null when unknown). A sub-region would be a country in the US.
city is the field for the city of the address (use null when unknown).
postcode is the field for the postal code of the address (use null when unknown). A postal code would be a zip code in the US.
returnByPopulation is an optional Boolean parameter that, if set to true, returns the location with largest population. The default is false.
adminLevel is an optional String parameter that specifies a specific field to return. This can be set to only one of the following constant or literal values (case-sensitive):
- ADMIN_LEVEL_CITY or 'City' returns the city of the address.
- ADMIN_LEVEL_COUNTRY or 'Country' returns the country of the address.
- ADMIN_LEVEL_REGION or 'Region' returns the region of the address, such as a state in the United States.
- ADMIN_LEVEL_REGIONID or 'RegionID' returns the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' returns the sub-region of the address, such as a county in the United States.
- ADMIN_LEVEL_SUBREGIONID or 'SubRegionID' returns the ID of the sub-region in the GeoNames database, such as "4952349" for Suffolk Country in Massachusetts.
- ADMIN_LEVEL_POSTCODE or 'Postcode' returns the postal code of the address.
- ADMIN_LEVEL_GEOCODE or 'Geocode' returns the geocode of the least hierarchical administrative level. This is the default.

Note that the adminLevel parameter is the only optional parameter and therefore is the only parameter that can be omitted. All other parameters must be specified, either with a value or as null.

The function returns the value requested by the adminLevel parameter. The returned data types for the adminLevel are:

String for adminLevel = City, Country, Postcode, Region, SubRegion, RegionID, SubRegionID
Geocode for adminLevel = Geocode

If the address resolves to a number of locations and returnByPopulation is true, the function will pick the location with the largest population and then return its geocode.

Example 1:

// Get the geocode for San Francisco in the US and return the location with the largest population.
geotagStructuredAddress( 'us', null, null, 'san francisco', null, true, 'Geocode')

Returns "39.76 -98.5" (the geocode of San Francisco, California).

Example 2:

// Get the region (state in the US) in which the Boston with the highest population is located.
geotagStructuredAddress('us', '', '', 'boston', '', true, 'Region')

Returns "Massachusetts" (because Boston, Massachusetts has the largest population of all the Boston locations in the US).

geotagUnstructuredAddress

Converts a valid address in a String attribute to a Geocode String address according to the admin level. Administrative divisions vary depending on the country, so the returned values may be different than expected.

This is a wrapper function for the Address GeoTagger data enrichment module. It adds a multi-assign attribute (column) to your data set that contains the Geocode address.

The syntax is:

geotagUnstructuredAddress(String addressText, String adminLevel, String addressGrain, Boolean validateAddress)

where:

addressText is the address String to process. This must be less than or equal to 350 characters.
adminLevel is an optional String parameter that specifies an administrative division to return. This can be set to only one of the following constant or literal values (case-sensitive):
- ADMIN_LEVEL_CITY or 'City' for a city match.
- ADMIN_LEVEL_COUNTRY or 'Country' for a country match.
- ADMIN_LEVEL_REGION or 'Region' for a region match, such as a state in the United States.
- ADMIN_LEVEL_REGIONID or 'RegionID' for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' for a sub-region match, such as a county in the United States.
- ADMIN_LEVEL_SUBREGIONID or 'SubRegionID' for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.
- ADMIN_LEVEL_POSTCODE or 'Postcode' for a postal code, such as a zip code in the US.
addressGrain is an optional String parameter that specifies an administrative division to help the GeoTagger find the most likely match for a given level. This can be set to only one of the following constant or literal values (case-sensitive):
- ADMIN_LEVEL_CITY or 'City' for a city match.
- ADMIN_LEVEL_COUNTRY or 'Country' for a country match.
- ADMIN_LEVEL_REGION or 'Region' for a region match.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' for a sub-region match.
- ADMIN_LEVEL_NONE or 'None' returns the most populous location that most closely matches the address String. This is the default value.
validateAddress is an optional Boolean parameter that specifies whether the GeoTagger should validate the address.

The returned data types for the adminLevel are:

String for adminLevel = City, Country, Postcode, Region, SubRegion, RegionID, SubRegionID
Geocode for adminLevel = Geocode

The following example shows how to retrieve Geocode addresses for country names in the "countries" String attribute:

geotagUnstructuredAddress(countries, 'Country', , true, 'en')
geotagUnstructuredAddress('New York, NY 10029', 'Region', 'SubRegion', false)

geotagUnstructuredAddressGetGeocode

Converts an IP address to a Geocode and returns its geocode field as an Object. This is a wrapper function for the Address GeoTagger module.

The syntax is:

geotagUnstructuredAddressGetGeocode(String addressText, String addressGrain, Boolean validateAddress)

where:

addressText is the address String to process. This must be less than or equal to 350 characters.
addressGrain is an optional String parameter that helps the GeoTagger to find the most likely match for a given level. This can be set to only one of the following values (case-insensitive):
- ADMIN_LEVEL_CITY or 'City' for a city match.
- ADMIN_LEVEL_COUNTRY or 'Country' for a country match.
- ADMIN_LEVEL_REGION or 'Region' for a region match, such as a state in the United States.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' for a sub-region match, such as a county in the United States.
- ADMIN_LEVEL_NONE or 'None' returns the most populous location that most closely matches the address String. This is the default value.
validateAddress is an optional Boolean parameter that specifies whether the GeoTagger should validate the address.

Examples:

geotagUnstructuredAddressGetGeocode(cities, ADMIN_LEVEL_CITY, false)

geotagUnstructuredAddressGetGeocode(cities, 'City', false)

getEntities

Returns all entities, of a specified type, from an input String attribute. The entities are returned as a list of Strings. This function creates a new multi-assign column in your data set for the entity results. This is a wrapper function for the Named Entity Recognition extractor module. Supports only English input text.

The syntax is:

getEntities(String attribute, String entityType)

where:

attribute specifies the String attribute that is to be processed.
entityType is a String parameter that specifies the type of entity to extract. You can specify only one of the following constant or literal values (case-sensitive):
- ENTITY_TYPE_PERSON or 'Person' returns all the person entities found in the attribute.
- ENTITY_TYPE_ORGANIZATION or 'Organization' returns all the organization entities found in the attribute.
- ENTITY_TYPE_LOCATION or 'Location' returns all the location entities found in the attribute. Location entities are names of places, such as "Boston" or "Canada".

Examples:

getEntities(claims, ENTITY_TYPE_LOCATION)

getEntities(reviews, 'Person')

getSentiment

Returns a String containing the overall sentiment of a String attribute. The attribute's sentiment can be one of the following:

POSITIVE
NEGATIVE

This is a wrapper function for the Sentiment Analysis (document level) data enrichment module.

The syntax is:

getSentiment(String textAttribute, String languageCode)

where:

attribute specifies the String attribute that is to be processed.
languageCode is an optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified, it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.

Example:

getSentiment(comments, 'English')

In the example, "comments" is a String attribute.

getTermSentiment

Extracts phrases in sentences that have a positive or negative sentiment. The function call will specify the type of phrases to be extracted and which type of sentiment. A list of the desired terms (as Strings) is returned.

The syntax is:

getTermSentiment(String textAttribute, String termAttribute, String sentimentCategory, String languageCode)

where:

textAttribute is the String attribute to process.
termAttribute is a String parameter that specifies the type of terms to extract, based on their sentiment (as set by the sentimentCategory argument). You can specify only one of the following values (case-sensitive) for the term type:
- ENTITY_TYPE_PERSON or 'Person' locates passages that contain person entities and returns the sentiment of those passages.
- ENTITY_TYPE_ORGANIZATION or 'Organization' locates passages that contain organization entities and returns the sentiment of those passages.
- ENTITY_TYPE_LOCATION or 'Location' locates passages that contain location entities and returns the sentiment of those passages.
- NOUN_GROUPS or 'NounGroups' extracts noun groups in sentences, based on their specified sentiment.
- KEY_PHRASES or 'KeyPhrases' extracts key phrases in sentences, based on their specified sentiment.
sentimentCategory is specifies the type of sentiment to be considered for the terms. You can specify SENTIMENT_POSITIVE (or 'Positive') for negative for positive sentiment or SENTIMENT_NEGATIVE (or 'Negative') for negative sentiment. All values are case-sensitive.
languageCode. An optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Spanish, French, German, and Italian. When specified it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the function will automatically detect the language model. Throws an error if a non-supported language is specified.

Examples:

getTermSentiment(comments, 'KeyPhrases', 'Positive')
getTermSentiment(comments, KEY_PHRASES, SENTIMENT_POSITIVE)

getTermSentiment(companies, 'Organization', 'Negative')
getTermSentiment(companies, ENTITY_TYPE_ORGANIZATION, SENTIMENT_NEGATIVE)

getTermSentiment(reviews, 'NounGroups', 'Positive')
getTermSentiment(reviews, NOUN_GROUPS, SENTIMENT_POSITIVE)

reverseGeotag

Returns the geocode address for a specified administrative division. Searches for the administrative division within the specified radius from the entered Geocode. This is a wrapper function for the Reverse GeoTagger data enrichment module that returns a single value.

The syntax is:

reverseGeotag(Geocode geoAttribute, String adminLevel, Double proximityThreshold)

where:

geoAttribute is the Geocode to process.
adminLevel is String parameter that specifies an administrative division to return. This can be set to only one of these constant or literal values (case-sensitive):
- ADMIN_LEVEL_CITY or 'City' for a city match.
- ADMIN_LEVEL_COUNTRY or 'Country' for a country match.
- ADMIN_LEVEL_REGION or 'Region' for a region match, such as a state in the United States.
- ADMIN_LEVEL_REGIONID or 'RegionID' for the ID of the region in the GeoNames database, such as "6254926" for Massachusetts.
- ADMIN_LEVEL_SUBREGION or 'SubRegion' for a sub-region match, such as a county in the United States.
- ADMIN_LEVEL_SUBREGIONID or 'SubRegionID' for the ID of the sub-region in the GeoNames database, such as "4943909" for Middlesex County in Massachusetts.
- ADMIN_LEVEL_POSTCODE or 'Postcode' for a postal code, such as a zip code in the US.
proximityThreshold is an optional Double parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

The function returns the geotagged address as requested by the adminLevel parameter. The returned data types for the adminLevel are:

String for adminLevel = City, Country, Postcode, Region, SubRegion, RegionID, SubRegionID
Geocode for adminLevel = Geocode

The following example uses two values to create a Geocode object, then returns the Geocode's city field:

reverseGeotag(toGeocode(42.35843, -71.05977), 'CITY', 'en', 50)

runExternalPlugin

Runs a custom, external Groovy script as defined in an external file of pluginName and returns the result of the script.

The syntax is:

runExternalPlugin(String pluginName, String attribute, Map options)

where:

pluginName is the name (base name and extension) of the Groovy script file (for example, MyPlugin.groovy).
attribute is the input String passed to the script.
options is an options Map, which contains any options to be used by the Groovy script. The default is to be empty.

For information on creating the external plug-in, see Extending the transform function library.

stripTagsFromHTML

Removes any HTML, XML and XHTML markup tags from the input String and returns the result as a String. This is a wrapper function for the Tag Stripper data enrichment module.

The syntax is:

stripTagsFromHTML(String attribute)

where:

attribute is the HTML String to process.

The function returns plain text.

toPhoneticHash

Produces a String hash of the input text (English only) that represents the phonetics of the text.

A word's phonetic hash is based on its pronunciation, rather than its spelling. One application for phonetic hashes is search engines. If a search term does not return any results, the search engine can compare the term's phonetic hash to the hashes of other terms and return results for the term that is the best fit. For example, "purple" and "pruple" have the same phonetic hash (PRPL), so a search for the misspelled term "pruple" would still yield results for "purple".

The syntax is:

toPhoneticHash(String attribute)

where:

attribute is the String attribute to process.

Example:

toPhoneticHash(terms)

In the example, "terms" is a String attribute.