About transformations and transformation scripts

Transformations are changes you can make to your project data set, after the source data has been processed and loaded into Studio. Transformations are effectively an ETL process to clean your data. Transformations can overwrite an existing attribute, modify attributes, or create new attributes. Any number of transformations can be combined into a transformation script. You run the script against the project data set. And if desired, you can create a new data set in the Catalog based on the transformed version.

The Transform page allows you to make corrections and enhancements to a project data set. The transformations are within the scope of the project and do not affect the data set in the Catalog for other Studio users.

For example, you can use Transform page to modify:
  • Change an attribute's data type
  • Change capitalization of values
  • Remove attributes or records
  • Populate missing values
  • Split columns into new ones (by creating new attributes)
  • Add or remove attributes, or overwrite existing attributes
  • Correct invalid or inconsistent values. For example, an attribute may have multiple versions of the same value (Wal-Mart, Walmart, Wal*Mart).
  • Group or bin values
  • Extract information from values
You can also provide additional information about the data or provide new ways to use the values. For example, you can:
  • Identify commonly used terms
  • Analyze sentiment

Custom transforms in a transformation script

You can also use the Groovy scripting language and a list of predefined Groovy-based transform functions available in Studio, to create a transformation script. Transformation scripts are collections of various transformations; they can contain any of the transform functions.

You can also write your own transformations from scratch using Groovy, within the same Transform page of Studio, using the Transformation Editor.

Running a transformation script

You run a transformation script by clicking Commit to project in the Transform editor. When you commit a transformation script to a project, the script runs against the project data set and modifies the project data set with each transform step in the script. However, the operations do not affect the public data set in the Catalog on which that project is based.

You can preview the result of changes before you commit the script. While the script is running, you can check its progress in the Notifications panel.

After you transform a project data set, you can optionally create a new data set based on your modifications. This allows other Studio users to access the new data set in Catalog.

Size of the data set to transform

If the source data is larger than one million records, the source data is automatically sampled during the ingest process to give you a data set limited to a maximum of one million records.

If the source data is smaller than one million records, the project contains the full data set rather than a sampled data set.