Enrichment functions

Enrichment functions are based on Data Enrichment modules used as part of data processing in Big Data Discovery. You can use these functions to extract meaningful information from your data and modify attributes to make them more useful for analysis.

The same functions are described in the Transform API Reference (Groovydoc).

More information on the Data Enrichment modules is available in the Data Processing Guide.

detectLanguage

Finds the language of a given document and returns an Oracle language code (for example, es for Spanish). For accurate results, the text should contain at least ten words.

detectLanguageaccepts the following parameter:
  • text. This is the data in type String to perform language detection on.

extractKeyPhrases

Extracts key phrases from a String and returns a list of phrases. The function calculates key phrases using TF/IDF algorithm, which takes the total number of times each term appears within the String and offsets that value by the number of times it appears within a larger body of work. Offsetting the value helps filter out frequently-used terms like "the" and "it". The body of work used as the control is selected internally based on the String's language; for example, the model used for English is based on a New York Times corpus. The extractKeyPhrases function is a wrapper function for the TF/IDF Term extractor enrichment module.

The number of key phrases returned by extractKeyPhrases is a function of the TF/IDF curve. By default, it stops returning terms when the score of a given term falls below ~68%.

extractKeyPhrases accepts the following parameters:
  • text. The text in type String that is to be processed. It is recommended that you convert the text to lowercase first, especially if it is in all caps.
  • language. An optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian. When specified it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the language is automatically detected.
Note: When you create a new attribute as a result of using this function, make sure the attribute is of type multi-assign.

extractNounGroups

Returns a String containing noun groups. A noun group is any noun, such as "movie" or "building". This is a wrapper function for the Noun Group Extractor enrichment module. This module finds and returns noun groups from a string attribute in each of the supported languages. It is used in tag cloud visualization, for finding commonly occurring themes in the data.

extractNounGroups accepts the following parameters:
  • text. The String to be processed.
  • language. An optional parameter that specifies the language name or code (for example "en", "English", "German") to improve accuracy. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian. When specified it forces the function to use a model specific to that language. When not specified, or when passed as null (this is the default), the language is automatically detected.

extractWhiteListTags

Uses a dictionary-matching algorithm that locates elements of a finite set of strings (the whitelist) within input text. The function finds all occurrences of any whitelist terms and returns a list of matching expansions. The input text is matched against a whitelist. A whitelist is newline-delimited. This is a wrapper function for the Whitelist Tagger enrichment module.

Each line may be either a comment (indicated with a # as the first character), or a matching directive comprised of either one or two values (separated by TAB). The second value is used to rewrite the match output.

Here is a simple example whitelist:
  • helium
  • neon
  • argon
  • krypton
  • xenon
  • radon

It could be rewritten as follows:

  • heliumHe
  • neonNe
  • argonAr
  • kryptonKr
  • xenonXe
  • radonRn

When this whitelist is run on the text "The only noble gas is radon", it would produce an output list of ['Rn']

extractWhiteListTags accepts the following parameters:
  • text. The String to process.
  • whitelist. A document containing whitelisted terms. This should be a plain text file containing a newline-delimited list of literals and configuration terms.
  • language. An optional parameter that specifies the String's language to improve accuracy. Set to English by default. Supported languages are English (US/UK), Danish, German, Spanish, French, Italian, Japanese, Korean, Simplified Chinese, Traditional Chinese, and Portuguese (Brazilian).
  • caseSensitive. Indicates whether input is case-sensitive (the default is false).
  • unbounded. Indicates whether to match whole words only (when set to false which is the default), or parts of words (when set to true). Ensures that "red" does not match "reduce".

geotagAddress*

A set of the following functions:
  • geotagAddressGetCity
  • geotagAddressGetCountry
  • geotagAddressGetGeocode
  • geotagAddressGetPostcode
  • geotagAddressGetRegion
  • geotagAddressGetSubRegion
  • geotagAddressGetRegionID
  • geotagAddressGetSubRegionID
Converts a valid address String to a Geocode object, such as city, country, geocode, postcode, region, subregion or region and subregion IDs. This is a wrapper function for the Address Geotagger data enrichment module. It adds a multi-assign attribute (column) to your data set that contains the following fields:
  • city
  • country
  • geocode (the address's latitude and longitude coordinates)
  • latitude
  • longitude
  • population
  • postal_code
  • region
  • sub_region
  • Geoname ID for the region or sub_region
geoTagAddress* accepts the following parameters:
  • arg1 address. The address String to process. This must be less than or equal to 350 characters.
  • Map. This is a map of advanced options:
    • PREFERRED_LEVEL. An optional parameter in type String that specifies an administrative division to improve accuracy. This can be set to only one of the following values (case-insensitive):
      • CITY. Target for a city match.
      • COUNTRY. Target for a country match.
      • REGION. Target for a region match, such as "state" in the United States.
      • SUB_REGION. Target for a subregion match, such as "county".
      • NONE. If this value is used, the function returns the most populous location that most closely matches the address String. This is the default value.
      Note: Administrative divisions vary depending on the country, so the returned values may be different than expected. Also, if your input value is not in the acceptable list, an exception is thrown.
    • STRICT_MODE. An optional Boolean parameter that specifies how the function should handle ambiguous or improperly-formatted addresses, such as one that contains an incorrect postal code. This can be set to one of the following:
      • true. If the address is invalid, the function returns null.
      • false. If the address is invalid, the function returns the closest match. This is the default.
The following example shows how to specify these parameters for a function geotagAddressGetSubRegion in a map:
geotagAddressGetSubRegion (' 1 Main Street Cambridge', ['PREFERRED_LEVEL':'CITY', 'STRICT_MODE':true])

geotagIPAddressGetCity

Converts an IP address to a Geocode and returns its city field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single value.

geoTagIPAddressGetCity accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetCountry

Converts an IP address to a Geocode and returns its country field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetCountry accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetGeocode

Converts an IP address to a Geocode and returns its geocode field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetGeoCode accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetPostCode

Converts an IP address to a Postal Code and returns its postal_code field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetPostCode accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetRegion

Converts an IP address to a Geocode and returns its region field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetRegion accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetRegionID

Converts an IP address to a Geocode and returns its Geoname ID for the region field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetRegionID accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetSubRegion

Converts an IP address to a Geocode and returns its sub_region field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetSubRegion accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

geotagIPAddressGetSubRegionID

Converts an IP address to a Geocode and returns its Geoname ID for the sub_region field as an Object. This is a wrapper function for the IP Address Geotagger data enrichment module that returns a single entity type.

geoTagIPAddressGetSubRegion accepts the following parameters:
  • IPAddress. The IP address to process, in type String.
  • language. An optional String parameter that specifies the output language. The default value is null, which sets the language to English.

getLocationEntities

Returns all location entities within a String as an Object. Location entities are names of places, such as "Boston" or "Canada". This function creates a new multi-assign column in your data set. This is a wrapper function for the name Entity extractor data enrichment module that returns a single entity type.

getLocationEntities accepts the following parameter:
  • text. The String to process.

getNegativeLocationEntitySentiment

Locates passages within a String that contain location entities and returns the negative sentiment of those passages as an Object.

getNegativeLocationEntitySentiment accepts the following parameters:
  • text. The String to process.
  • language. An optional parameter that specifies the language in type String to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getNegativeNounGroupsSentiment

Locates passages within a String that contain noun groups and returns the negative sentiment of those passages as an Object.

getNegativeNounGroupsSentiment accepts the following parameters:
  • text. The String to process.
  • language. An optional parameter that specifies the language in type String to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German and Italian.

getNegativeOrganizationEntitySentiment

Locates passages within a String that contain organization entities and returns the negative sentiment of those passages as an Object.

getNegativeOrganizationEntitySentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getNegativePersonEntitySentiment

Locates passages within a String that contain person entities and returns the negative sentiment of those passages as an Object.

getNegativePersonEntitySentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getNegativeTFIDFSentiment

Extracts key phrases in sentences that have a negative sentiment.

getNegativeTFIDFSentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German and Italian.

getOrganizationEntities

Returns an Object containing the organization entities found within a String. This is a wrapper function for the Name Entity extractor data enrichment module that returns a single entity type.

Note: This function creates a new multi-assign column in your data set.
getOrganizationEntities accepts the following parameter:
  • arg1. The String to process.

getPersonEntities

Returns an Object containing the person entities found within a String. This is a wrapper function for the Name Entity extractor data enrichment module that returns a single entity type.

Note: This function creates a new multi-assign column in your data set.
getPersonEntities accepts the following parameter:
  • arg1. The String to process.

getPositiveLocationEntitySentiment

Locates passages within a String that contain location entities and returns the positive sentiment of those passages as an Object.

getPositiveLocationEntitySentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getPositiveNounGroupsSentiment

Locates passages within a String that contain noun groups and returns the positive sentiment of those passages as an Object.

getPositiveNounGroupsSentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getPositivePersonEntitySentiment

Locates passages within a String that contain person entities and returns the positive sentiment of those passages as an Object.

getPositivePersonEntitySentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getPositiveOrganizationEntitySentiment

Locates passages within a String that contain organization entities and returns the positive sentiment of those passages as an Object.

getPositiveOrganizationEntitySentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported language is English only.

getPositiveTFIDFSentiment

Extracts key phrases in sentences that have a positive sentiment.

getNegativeTFIDFSentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. If set to null (which is the default value), the language is automatically detected. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian.

getSentiment

Returns an Object containing the overall sentiment of a String. This is a wrapper function for the Sentiment Analysis (document level) data enrichment module. The String's sentiment can be one of the following:
  • POSITIVE
  • NEGATIVE
getSentiment accepts the following parameters:
  • arg1. The String to process.
  • language. An optional parameter that specifies the String's language to improve accuracy. Supported languages are English (UK/US), Portuguese (Brazilian), Spanish, French, German, and Italian. If set to null (which is the default value), the language is automatically detected.

reverseGeotagGetCity

Returns the city field from a Geocode as an Object. Searches for cities within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetCity accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetCountry

Returns the country field from a Geocode as an Object. Searches for countries within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetCountry accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetPostCode

Returns the postal_code field from a Geocode as an Object. Searches for post codes within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetPostCode accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetRegion

Returns the region field from a Geocode as an Object. Searches for regions within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetRegion accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetRegionID

Returns the Geoname region ID field from a Geocode of the region field as an Object. Searches for regions within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetRegion accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetSubRegion

Returns the sub_region field from a Geocode as an Object. Searches for sub-regions within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetSubRegion accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

reverseGeotagGetSubRegionID

Returns the Geoname ID of the Geocode from the sub_region field as an Object. Searches for sub-regions within the specified radius from the entered Geocode. This is a wrapper function for the Reverse Geotagger data enrichment module that returns a single value.

reverseGeotagGetSubRegion accepts the following parameter:
  • geo. The Geocode to process.
  • language. An optional parameter that specifies the output language. The default value is null, which sets the output language to English.
  • proximityThreshold. An optional parameter that specifies the maximum distance in miles allowed for input geocode and output geographic location. If this parameter is not specified, the default of 100 miles is used. If the distance exceeds the threshold, null is returned.

runExternalPlugin

Runs the external Groovy script as defined in an external file of pluginName, and returns the result of the script.

runExternalPlugin accepts the following parameters:
  • pluginName. The name of the external plugin.
  • arg1. An argument passed to the external plugin.

stripTagsFromHTML

Removes any HTML, XML and XHTML markup tags from the input String and returns the result as an Object. This is a wrapper function for the Tag Stripper data enrichment module.

stripTagsFromHTML accepts the following parameter:
  • arg1. The HTML String to process.

toPhoneticHash

Produces a String hash of the input text (English only) that represents the phonetics of the text.

A word's phonetic hash is based on its pronunciation, rather than its spelling. One application for phonetic hashes is search engines. If a search term does not return any results, the search engine can compare the term's phonetic hash to the hashes of other terms, and return results for the term that is the best fit. For example, "purple" and "pruple" have the same phonetic hash (PRPL), so a search for the misspelled term "pruple" would still yield results for "purple".

toPhoneticHash accepts the following parameter:
  • arg1. The String to process.