Data sets in BDD can be sampled, or they can represent a full data set.
Sampled data sets
A sample data set in BDD represents a random sample of a source data set in Hive. If the data set originates in Hive, you use the Data Processing CLI to load it. The DP CLI uses the default sample size of 1 million records. You can specify a different sample size at data loading time.
Full data sets
A full data set in BDD represents a data set that contains all records, if you compare it to the source it was loaded from. For example, if a data set originates in Hive, and the sample size in DP CLI is greater than the record count in the source Hive table, this data set is loaded in full.
For a summary of how to get from a sample to a full data set, see Data loading and sample size
For more information on sampling and data set loading during data processing, see the Data Processing Guide.
For information on adding and managing data sets in Studio, including loading a full data set, see the Studio User's Guide.