Converting a project to a BDD Application

A BDD Application is a logical designation of a project that contains one or more data sets that you transform, update, and maintain for long-lasting data analysis, reporting, and sharing. In this release, you can think of an application as a production-quality project. This topic describes the high-level workflow of converting a project into a BDD application and provides pointers to more detailed instructions as necessary.

Before you convert a project to an application there are a number of tasks that you must have already performed. These tasks are common to any project and are not specific to building a BDD Application:
  1. Create a data set. You can do this by uploading your source data in Studio or by creating a Hive table and running a data processing workflow.
  2. Add one or more data sets to a project. For details, see Adding a data set to an existing project.
  3. Set access permissions on data sets and projects. For details, see Managing Access to Projects.
  4. If necessary, transform the project data set to clean up the attributes and attribute values. This might include reformatting attributes, changing data types, splitting attributes into more specific attributes, creating new attributes, and so on. For details, see About the Transform page user interface.
  5. Build components to visualize the data in Discover that you want to preserve in a more productized way.

To convert a project to a BDD Application:

  1. Optionally, if you created one or more data sets by uploading source data into Studio, you should change the source of each data set to point to a new Hive table. This step is not required if data set is based on a table created directly in Hive. (This step is necessary for data encoding reasons.)
    1. In Studio, find the data set's current Hive table name by clicking Show Data Set Properties. The name is of the form default.my_uploaded_data.
    2. Browse to the Hive query editor for your Hadoop distribution. For example, in a Cloudera environment, this is the Hive Query Editor in Hue.
    3. Run the following Hive command and specify the data set's Hive table name:
      SHOW CREATE TABLE `default.my_uploaded_data`
    4. Copy the resulting create table command and modify it in a text editor. Modify the table definition to point to the new location in HDFS of the data set's source files. Specify the column data types using Hive types; by default, the table description shows all attributes as strings. Also, change the table encoding properties like Row Format, Storage Format, and SerDe as necessary to match the source file encoding. Optionally change other table properties as necessary like dataSetDisplayName or comment.
    5. Run drop table to remove the existing Hive table.
      DROP TABLE `default.my_uploaded_data`
    6. Run the new create table command to re-create the table with the same name and columns but new source file location and encoding.
    7. Test that the new table definition is correct by browsing in Hue to Metastore Tables. Find the re-created Hive table and click on Sample. Optionally run a Hive query to count the number of rows in the table to confirm the expected data set size.
    8. Repeat these steps for each data set in the project that has a Data source type value of Excel, Delimited, or JDBC.
    Important: Now that you have a new Hive table that has the same name as the old Hive table, do not use the Actions > Reload Data set feature on the quick look for this specific data set in the Catalog. The Reload Data set feature invokes the file upload wizard and allows you to overwrite the production Hive table that you just created.
  2. Optionally, specify a record identifier for the project if you plan to run incremental updates to the data.
  3. Load the full data set into the project.
    This step provides all the source data so that visualization and analysis includes the full record set. For details, see Loading the full data set in a project.
  4. Schedule updates to data sets using the Data Processing CLI and cron jobs.
    For details, see "DP CLI configuration" and "DP CLI cron job" in the Data Processing Guide.