About data sets

Data sets in BDD are called BDD data sets. This helps you to distinguish them from source data sets in Hive tables.

A BDD data set is the central concept in the product architecture. Data sets in BDD originate from these sources:
  • Many data sets are loaded as a result of the data processing workflow for loading data. It runs when you launch DP CLI, after installing Big Data Discovery. This process adds data sets to Studio's Catalog.
  • Other data sets appear in Studio because you load them from a personal file or a JDBC data source.
  • Also, you can create a new BDD data set by transforming an existing data set, or by exporting a data set to HDFS as a new Hive table. Such data sets are called derived data sets.

When the Data Processing component runs data loading, or when you add a data set by uploading a file or data from a JDBC source, it appears in Studio's Catalog. You can use Preview to see the details for all data sets in Catalog.

You can also use Studio's Data Set Manager to see information about all data sets in a project. For each data set in a project, you can see its record count, when it was added, whether it is private to you, and other details.