Creating a new Hive table with the transformation script

When you use Create a Data Set in the Transformation Editor, your transformation script is applied to the source Hive table your project data set was created from. This operation creates a new Hive table in the Dgraph index and adds a new data set to the Catalog.

Note: Due to the way BDD converts Hive source table data types to its own data types, applying your script to the source table may result in some omitted or changed data types. For example, some complex Hive data types that do not match the Dgraph data types are omitted. For more information, see Data type conversion.

To create a new data set:

  1. Click the menu icon in the transformation script panel and select Create a Data Set.
    The Create a Data Set dialog box opens.
  2. In the New Hive Table Name field, enter a unique name for the new Hive table.
    The name you choose can only contain alphanumeric characters and underscores.
  3. In the New Hive Table Data Directory, enter the location in HDFS where you want your table to be stored.
  4. In the New Data Set Name field, enter a unique name for the new data set.
    This is the name the new data set will have in Catalog. The name you choose can be different from the Hive table's name.
  5. Optionally, enter information about your transformation script or new data set in the Comments field.
    This will be stored as the new table's metadata, along with the transformation script and the date the table was created.
  6. Click Save.
    A dialog box appears indicating that the transformation is in progress and may take several hours to complete.
If the script is successful, the new Hive table will be added to the index and the new data set will appear in Catalog.

If you do not see the new data set in Catalog, then the script failed. You can learn more about why it failed by checking the Data Processing logs. For more information, see Transform logging.

When you apply your transformation script to the source Hive table, data processing in Big Data Discovery does the following:
  1. Obtains the transformation script from Studio.
  2. Retrieves the schema of the transformed project data set from the Dgraph.
  3. Creates a new Hive table (let's name it HT2 in this example), using the project data set's schema.
  4. Loads the data row by row from the original source Hive table (let's name it HT1) to the HT2 Hive table, and at the same time runs the transformation script on each loaded row, and saves the transformed data as HT2.
  5. Samples the HT2 Hive table (this is the new Hive table with the transformed data) and adds the resulting data set to the Catalog.