Loading the full data set in a project

After you add a data set to a project, you can choose to load the full data set into the project. This is useful for comprehensive data analysis and building a BDD application. Remember that without this full data load, Studio displays a sampled data set of approximately 1 million records if the full data set is larger than 1 million records.

The Load Full Data Set option behaves as follows:
  • It loads all records stored in the Hive table for a data set. This includes any table updates performed by a system administrator. The full data load happens during the initial full data load only. After the first full data load, the action changes to Reload Data Set and you can reload the data set any number of times.
  • It increases a sampled data set up to the full size of the data set.
  • If the project contains a transformation script that you have committed, then Studio runs that script against the full data set. This way, all transformations apply to the full data set in the project.

The following diagram shows the workflow of loading a full data set into a project:


Shows a diagram of the data lifecycle that includes the load full step.

In this workflow, the following actions take place:
  1. You load a data set from a file or JDBC data source. This is the initial load of the data set into the Catalog.
  2. You can then explore the data and add it to a project to use Transform and Discover.
  3. You load the full data set and reload the data set as necessary.

Notice that loading the full data set affects only the data set in a specific project: it does not affect the data set as it displays in the Catalog.

To check if a data set has already been fully loaded into a project, go to the Data Set Manager page and see if the Record Data Volume property indicates Full data set is loaded.

To load the full data set in a project:

  1. From the Configuration Options menu, select Project Settings.
  2. Select Data Set Manager and expand the options next to the data set name.
  3. Select Load Full Data Set.
  4. In the confirmation dialog, select Load Full Data Set again.
  5. Return to Explore or Transform to monitor the progress of the load operation.
Depending on the size of the data set, the load may take some time to complete. After the operation finishes, you can check the Data Volume information to see that the Full data set is loaded and the Explore header indicates the full record count.