About joining data sets

Studio can perform a left outer join, inner join, or full outer join against one or more data sets based on a key, or compound key, that you create.

The primary data set displays the join changes and the secondary data set is not modified.

The updated records have an additional attribute, named Data Source Name that specifies which data sources, by name, contributed to the new records. For example, in a left join, every record is tagged with dataset1, but there might be some records that have no matching data from dataset2. Those records would only contain the value dataset1. All other records would contain both dataset1 and dataset2 as attribute values.

Differences between joining and linking data sets

Joining data sets performs a full SQL join combining records from the primary and secondary sides of the join. Joining materializes new records that replace the data in the primary data set of your project.

Linking data sets links the data at query time for temporary use in Discover components. Both data sets continue to be stored as separate data sets with only a key (link) that connects them during queries. Linked data sets are not permanently joined in any way.

A link is logically similar to a database view. Links provide a way to temporarily look at relationships in multiple data sets without the level of persistence and data processing required by a join. For more information about linking, see Linking Project Data Sets.

Updates to joined data

Joins have the same update model as all other transforms. If you load a full data set or run an incremental update, Studio notifies you that the data set has changed and you can accept or reject the update in your project. If you accept an update, Studio reads the changes from the Hive table and re-runs the join operation.

Limits to the sample size after joins

A join operation could result in a larger data set size. However, the bdd.sampleSize setting, on the Data Processing Settings page, limits the sample size that results from transforms such as Join, Aggregate, and FilterRows. In other words, a join operation never results in a sample size that exceeds the value of bdd.sampleSize.