Creating a data set from a file

You can create a new data set in Studio by uploading personal data from files. After upload, the data is available as a data set in the Catalog.

Studio supports Microsoft Excel files, delimited files such as CSV, TSV, and TXT, and also compressed file such as ZIP, GZ and GZIP. A compressed file may include only one delimited file.

Microsoft Excel files must have a suffix of XLS or XLSX and may have been created using any of the following versions:
  • Excel 2016
  • Excel 2013
  • Excel 2010
  • Excel 2007
  • Excel 97 - 2003

To create a data set from a file:

  1. Click the Add Data Set option on the Catalog.

    This option adds the new data set to the Catalog. You can also add a new data set from within a project.

  2. Click Create a data set from a file.
  3. Click Browse, locate the file, and click Open.
  4. Click Next.
  5. In the Preview page, you can both edit attributes and limit the data before you upload it.
    1. To exclude an attribute from the data set, deselect its check box.
    2. To modify the name of an attribute as it appears in the data set, select the column header and edit the name of the attribute.
  6. Expand Basic Settings and set the following options:
    Setting Description
    My data includes header row Specify if the file includes a header row. If you deselect this option, Studio creates an alphabetized list in place of attribute names.
    Skip the first 0 rows If necessary, you can indicate how many rows to skip from the top of the file.
    Sheet Microsoft Excel files only. Select which sheet to load from the list. A data set corresponds to one sheet. Run the wizard again if you need to process multiple sheets.
    Fields are delimited by Delimited file formats only. Select a field delimiter from the list. This selection often corresponds to the file format, comma separated (CSV), tab separated (TSV/TAB), etc.
    Fields are quoted Delimited file formats only. Specify whether the fields contain quoted values.
    Quote Character Delimited file formats only. If you enabled Fields are quoted, select the quote character.
    Quote escape character Delimited file formats only. If you enabled Fields are quoted, select the quote escape character.
  7. Expand Advanced Settings and set the following options:
    Setting Description
    Character Encoding Delimited file formats only. Select the file encoding. If you are unsure, you may have to open the file in a full featured text editor and use the editor to detect the encoding.
    Language Specify the language of text data in your file. This setting is used during data processing and then used for value and keyword searches.
  8. Click Next.
  9. On the Create your data set page:
    1. Specify a name for the data set as it appears in the Catalog.
    2. Optionally, specify a description for the data set.
    3. Optionally, specify a Hive table name. By default, the Hive table name is the same as the data set name. If you create a data set by the same name as an existing data set, you must specify a different Hive table name. Studio maps the data set name to a unique table name.
  10. Click Create and then click Add Another Data Set if you have more files to upload or click Return to Catalog while the file is being processed.
A new data set based on the file is available in the Catalog.

If the file upload fails due to a connection timeout, the file may be too large to upload from Studio. You can work around this issue by asking your Hive database administrator to import the source file into a Hive table, and then you run the Data Processing CLI utility to process the table. After data processing, a new data set, based on the file, is available in the Catalog .