You can create a new data set in the Catalog by uploading personal data files into Studio and also by importing and filtering a JDBC data source into Studio.
Permissions
A Studio user must have the role of User, Power User, or Administrator to load data from a file. Creating a data set from a JDBC data source additionally requires you to provide the user name and password of the person who has database credentials.
Maximum sample size for file upload
By default, any file you upload is processed to result in a maximum sample size of up to 1,000,000 records. Files that contain more than 1,000,000 records are down sampled to approximately 1,000,000 records. Files with less than 1,000,000 records contain the full record set in the resulting data set.
If necessary, you can increase the default setting from 1,000,000 records to another value. For details, see the bdd.sampleSize
documentation in the Administrator's Guide.
Potential upload timeouts
If a file upload fails due to a connection timeout, the file may be too large to upload from Studio. You can work around this issue by asking your Hive database administrator to import the source file into a Hive table, and then you run the Data Processing CLI utility to process the table. After data processing, a new data set, based on the file, is available in the Catalog.
Duplicate columns and multi-value attributes
Note that if your personal data file has columns with identical headings, then in the new data set, the columns are converted into a single multi-value attribute. For example, if your data includes:
Item | Color | Color | Color |
---|---|---|---|
T-Shirt | Red | Blue | Green |
Sweatshirt | Red | White |
Then in the final data set, the result is:
Item | Color |
---|---|
T-Shirt | Red, Blue, Green |
Sweatshirt | Red, White |
Anti-Virus and Malware
Studio can load Excel spreadsheets and delimited files. Oracle strongly encourages you to use anti-virus products prior to uploading files into Studio. Studio converts these files into the Hadoop Avro format, uploads the data to HDFS, and then registers a Hive table for the data. The original file is then discarded.
Parent topic: Managing Data Sets