Sampled and full data sets

Data sets in BDD can be sampled, or they can represent a full data set.

Sampled data sets

A sample data set in BDD represents a random sample of a source data set in Hive. If the data set originates in Hive, you use the Data Processing CLI to load it. The DP CLI uses the default sample size of 1 million records. You can specify a different sample size at data loading time.

  • If you specify a sample size that is less than the size of the source data set in Hive, a sample data set is loaded into BDD.
  • If you specify a sample size that is greater than or equal to the size of the source data set, a full data set is loaded.

Full data sets

A full data set in BDD represents a data set that contains all records, if you compare it to the source it was loaded from. For example, if a data set originates in Hive, and the sample size in DP CLI is greater than the record count in the source Hive table, this data set is loaded in full.

For a summary of how to get from a sample to a full data set, see Data loading and sample size

For more information on sampling and data set loading during data processing, see the Data Processing Guide.

For information on adding and managing data sets in Studio, including loading a full data set, see the Studio User's Guide.