About creating a data set in the Catalog

You can create a new data set in the Catalog by uploading personal data files into Studio and also by importing and filtering a JDBC data source into Studio.

Permissions

A Studio user must have the role of User, Power User, or Administrator to load data from a file. Creating a data set from a JDBC data source additionally requires you to provide the user name and password of the person who has database credentials.

Maximum sample size for file upload

By default, any file you upload is processed to result in a maximum sample size of up to 1,000,000 records. Files that contain more than 1,000,000 records are down sampled to approximately 1,000,000 records. Files with less than 1,000,000 records contain the full record set in the resulting data set.

If necessary, you can increase the default setting from 1,000,000 records to another value. For details, see the bdd.sampleSize documentation in the Administrator's Guide.

Potential upload timeouts

If a file upload fails due to a connection timeout, the file may be too large to upload from Studio. You can work around this issue by asking your Hive database administrator to import the source file into a Hive table, and then you run the Data Processing CLI utility to process the table. After data processing, a new data set, based on the file, is available in the Catalog.

Duplicate columns and multi-value attributes

Note that if your personal data file has columns with identical headings, then in the new data set, the columns are converted into a single multi-value attribute. For example, if your data includes:

Item	Color	Color	Color
T-Shirt	Red	Blue	Green
Sweatshirt	Red	White

Then in the final data set, the result is:

Item	Color
T-Shirt	Red, Blue, Green
Sweatshirt	Red, White

Anti-Virus and Malware

Studio can load Excel spreadsheets and delimited files. Oracle strongly encourages you to use anti-virus products prior to uploading files into Studio. Studio converts these files into the Hadoop Avro format, uploads the data to HDFS, and then registers a Hive table for the data. The original file is then discarded.

Creating a data set from a file
You can create a new data set in Studio by uploading personal data from files. After upload, the data is available as a data set in the Catalog.
Creating a data set from a JDBC data source
If a Studio administrator has already created a data connection and added a JDBC data source, then you can import and filter a JDBC data source into Studio. After import, the data source is available as a data set in the Catalog. For information about creating data sources, see the Administrator's Guide.

Parent topic: Managing Data Sets