About transformations and transformation scripts

Transformations are changes you can make to your project data set, after the source data has been processed and loaded into Studio. Transformations can be thought of as a substitute for an ETL process of cleaning your data. Transformations can overwrite an existing attribute, modify attributes, or create new attributes.

For example, you can do any of the following transformations:

Change an attribute's data type
Change capitalization of values
Remove attributes or records
Split columns into new ones (by creating new attributes)
Add or remove attributes, or overwrite existing attributes
Group or bin values
Extract information from values.

Most transformations are available directly as specific options in the Transform page of Studio.

You can you use the Groovy scripting language and a list of custom, predefined Groovy-based transform functions available in Big Data Discovery, to create a transformation script. Transformation scripts are collections of various transformations; they can contain any of the transform functions.

You can also write your own transformations from scratch using Groovy, within the same Transform page of Studio, using the Transformation Editor.

When you commit a transformation script to a project, the script runs against the data sample but does not affect the data set in the Catalog. You can either apply the transform script to your current project, or create a new data set using the transformation script:

When you commit the transformation script to the project, no new entry is created in the Catalog, but the current project does show the effects of the transform script.
When you create a new data set using the transformation script, a new data set entry is added to the Catalog for use by other projects. That new data set is a new sample of the original source Hive table after the transformation script is applied. Creating a new data set in this way does not apply the transformation script to the current project.