Creating a new data set from the transformed data

After running a transformation script, you can create a new data set in the Catalog based on the transformed project data.

If you create a new data set, Studio runs the transformation script against the entire data set, not just the data sample in the project, and Studio applies the transformations, and enrichments to the new data set. A new sample is generated and the data set is available in the Catalog.

Note:

Due to the way BDD converts Hive source table data types to its own data types, applying your script to the source data set may result in some omitted or modified data types. For example, some complex Hive data types that do not match the Dgraph data types are omitted. For more information, see Data type conversion.

To create a new data set from the transformed data:

  1. Select Transform.
  2. From the transformation script menu, select Create a Data Set.
    For example:
    Shows the Create a Data Set option in Studio.

  3. In the Data Set Name field, type the name of the new data set.
  4. If desired, provide any notes about the new data set in the Description field.
  5. Optionally, expand Advanced Hadoop Options and specify a Hive table name. By default, the Hive table name is the same as the data set name. If you create a data set by the same name as an existing data set, you must specify a different Hive table name. Studio maps the data set name to a unique table name.
  6. Click Create Data Set.
    (You can check the progress of this process by looking at the Oozie Web UI tool.)
After BDD finishes creating a new table in the Hive database and performing data processing, the data set becomes available in the Catalog. If you do not see the new data set in Catalog, then the script failed. You can learn more about why it failed by checking the Data Processing logs. For more information, see Transform logging.