Transformations are changes you can make to your
project data set, after the source data has been processed and loaded into
Studio. Transformations can be thought of as a substitute for an ETL process of
cleaning your data. Transformations can overwrite an existing attribute, modify
attributes, or create new attributes.
For example, you can do any of
the following transformations:
- Change an attribute's data
type
- Change capitalization of
values
- Remove attributes or records
- Split columns into new ones
(by creating new attributes)
- Add or remove attributes, or
overwrite existing attributes
- Group or bin values
- Extract information from
values.
Most transformations are available directly as specific options in the
Transform page of Studio.
You can you use the Groovy scripting language and a list of custom,
predefined Groovy-based
transform functions available in Big Data Discovery, to
create a
transformation script. Transformation scripts are
collections of various transformations; they can contain any of the
transform functions.
You can also write your own transformations from scratch using Groovy,
within the same
Transform page of Studio, using the
Transformation Editor.
When you commit a transformation script to a project, the script runs
against the data sample but does not affect the data set in the
Catalog. You can either apply the transform script
to your current project, or create a new data set using the transformation
script:
- When you commit the
transformation script to the project, no new entry is created in the
Catalog, but the current project does show the
effects of the transform script.
- When you create a new data
set using the transformation script, a new data set entry is added to the
Catalog for use by other projects. That new data
set is a new sample of the original source Hive table after the transformation
script is applied. Creating a new data set in this way does not apply the
transformation script to the current project.