Creating Transforms

Create transforms to prepare and enrich data. You create a transform based on sample data, and after editing and publishing it, you can apply the transform to an entire data set in a cluster.

To create a transform:

  1. On the Home or Catalog page, click Create Transform.
    The Create Transform page appears.
  2. In the Name field, enter a name to identify the transform. Use only alpha-numeric and underscore characters to name your transform. Other special characters are not allowed.
  3. In the Description field, describe the use of this transform.
  4. In the Source field, click Select.
    To run a search in the Search field, enter part of the name or the complete name of the source.
    The Select dialog box appears.
  5. From the Source drop-down list, click the source where your sample or raw data file is located.
    A list of directories for the selected source appears in the Select File dialog.
  6. In the Select File dialog, go to the directory where your sample or raw data file is located, select the file, and then click OK. Alternatively, you can just select the directory, and all the files in that directory will be processed as part of the transform.
    For more information on supported file types, see Understanding the Supported File Types.
    You can also select a directory. If you select a directory instead of a specific file, you will receive a warning, but all the files in that directory will be processed.
    Description of transform_directory_warning.png follows
    Description of the illustration transform_directory_warning.png
  7. Optionally, select any of the following:
    • Smart Sample: Allow the processing engine to use a sampling algorithm on the selected file instead of loading and processing the entire set of rows in the source. This shortens the time for data preparation.

      Note:

      Smart samples are loaded only for files that contain less than one million rows. Otherwise, the entire data file is automatically loaded in your Hadoop cluster.
    • Contains Headers: This option is selected by default. If your selected source doesn’t contain headers, then deselect this option.
  8. Click Submit.
    You return to the Catalog page where your new transform is listed and it begins processing.
    Processing transform

    On the right side of the Catalog page, the Activity Stream provides a status on the data ingestion and profiling for your new transform.
    Description of activitystream.png follows
    Description of the illustration activitystream.png

    When the transform is successfully processed, the status changes in the Catalog list.


    Description of transform_ready.png follows
    Description of the illustration transform_ready.png
  9. To open the transform and view its contents, click the name of the transform or select Edit from the More ActionsMore Actions menu.
    The main authoring page appears. The transform is created using patterns that the system automatically recognizes. The system also displays recommendations to fix and enrich the data. For more information, see Understanding Recognized Patterns and Data Enrichments.
  10. Edit the transform script.
    For more information on editing the transform script, see Task Overview for Authoring the Transform Script.
  11. Click Done.
    The changes that you made to the transform are saved.
The created transform is now part of the Catalog. Use it to prepare and enrich data.
Publish your transform and apply it to other sources. For more information, see Publishing Transforms.
Schedule your transform to run periodically on one or more sources. For more information, see Understanding Policies and Scheduling.