Data loading and sample size

You can load either a sample or a full data set. If you load a sample, you can go to a full data set later. This topic summarizes how to get from a sample to a full data set.

These options are high-level summaries only. For detailed steps, see the referenced documentation for each option.
  • Controlling sample size when loading from Data Processing CLI. DP CLI has a parameter for data size sample. The default sample size is 1 million records. When you use DP CLI for data loading, you can customize this parameter:
    • If it is less than the record count of source records in Hive, a full data set is loaded. In this case you already loaded a full data set. This is indicated in the Data Set Manager in Studio, in the Data Volume field:
      Data Volume shows that full data set is loaded.

    • If it is greater than the number of records in Hive, a sampled data set is loaded, based on the sample size you specify. In this case, you can use DP CLI with --Incremental update flag, or you can use Load Full Data Set in Studio, to load the entire source data set from Hive. You will then have a full data set in BDD.

    For detailed information on specifying the sample size with DP CLI, see the Data Processing Guide.

  • Controlling data set size when loading from a file or a JDBC source.

    If you load a data set from a personal file or import it from a JDBC source, then all data is loaded. However, it may still be a sample if you compare it to the source data you may also have elsewhere on your system.

    If you later want to add full data from the source, you can locate the Hive data set that BDD created when you loaded a file. Next, use the drop command to place that data set in Hue, and replace it with a production Hive table. You can then run Load Full Data Set on this table in Studio. This will load a full data set.

    This process is known as creating a BDD application. For detailed steps on this procedure, see the topic on creating a BDD application in the Studio User's Guide.