Loading Data

You can load datasets into ADS, either locally or from network file systems.

You can open datasets with DatasetFactory or DatasetBrowser. DatasetFactory allows datasets to be loaded into ADS

DatasetBrowser supports opening the datasets from web sites and libraries, such as scikit-learn directly into ADS.

When you open a dataset in DatasetFactory, you can get the summary statistics, correlations, and visualizations of the dataset.

ADS Supports:

Data Sources

Oracle Cloud Infrastructure Object Storage

Oracle Database with cx_Oracle

Autonomous Databases: ADW and ATP

Hadoop Distributed File System

Amazon S3

Google Cloud Service

Microsoft Azure



NoSQL DB instances

Elastic Search instances

HTTP and HTTPs Sources

Your local files

Data Formats

pandas.DataFrame, Dask.DataFrame

Array, Dictionary

Comma Separated Values (CSV)

Tab Separated Values (TSV)


Javascript Object Notation (JSON)


xls, xlsx (Excel)


Hierarchical Data Format 5 (HDF5)

Apache server log files



Attribute-Relation File Format (ARFF)

Data Types

Text Types (str)

Numeric Types (int, float)

Boolean Types (bool)

ADS Does Not Support:

Data Sources

Data that you don’t have permissions to.

Data Formats

Text Files



Raw Images


Data Types

Sequence Types (list, tuple, range)

Mapping Types (dict)

Set Types (set)

Connecting to Data Sources

You can load data into ADS in several different ways from Oracle Cloud Infrastructure Object Storage, cx_Oracle, or S3. Following are some examples.

Begin by loading the necesary libraries and modules:

import pandas as pd
import numpy as np

from ads.dataset.factory import DatasetFactory
from ads.dataset.dataset_browser import DatasetBrowser

import ads

Oracle Cloud Infrastructure Object Storage

To open a dataset from Object Storage using the resource principal method, you can use the following example, replacing the angle bracketed content with the location and name of your file:

import ads
from ads.dataset.factory import DatasetFactory
bucket_name = <bucket-name>
file_name = <file-name>
storage_options = {'config': {'tenancy': os.environ['TENANCY_OCID'], 'region': os.environ['NB_REGION']}}
ds = DatasetFactory.open(f"ocis://{bucket_name}/{file_name}", storage_options=storage_options)

To open a dataset from Object Storage using the Oracle Cloud Infrastructure configuration file method, include the location of the file using this format oci://<bucket-name>/<file-name> and modify the optional parameter storage_options. Insert:

For example:

ds = DatasetFactory.open("oci://<bucket-name>/<file-name>", storage_options = {
   "config": "~/.oci/config",
   "profile": "DEFAULT_USER"

Local Storage

To open a dataset from a local source, use DatasetFactory.open and specify the path of the data file:

ds = DatasetFactory.open("/path/to/data.data", format='csv', delimiter=" ")

Oracle Database with cx_Oracle

cx_Oracle Logo

To connect to Oracle Databases from Python, you use the cx_Oracle package that conforms to the Python database API specification.

You must have the client credentials and connection information to connect to the database. The client credentials include the wallet, which is required for all types of connections. Use these steps to work with ADB and wallet files:

  1. From the Console, go to the Oracle Cloud Infrastructure ADW or ATP instance page that you want to load the dataset from, and then click DB Connection.

  2. Click Download Wallet.

  3. You have to enter a password. This password is used for some ADB connections, but not the ones that are used in the notebook.

  4. Create a folder for your wallet in the notebook environment (<path_to_wallet_folder>).

  5. Upload your wallet files into <path_to_wallet_folder> folder using the Jupyterlab Upload Files button.

  6. Open the sqlnet.ora file from the wallet files, and then configure the METHOD_DATA to be: METHOD_DATA = (DIRECTORY="<path_to_wallet_folder>")

  7. Set the env variable, TNS_ADMIN. TNS_ADMIN, to point to the wallet you want to use.

In this example a Python dictionary, creds is used to store the creditionals. However, it is poor security practice to store this information in a notebook. The notebook ads-examples/ADB_working_with.ipynb gives an example of how to store them in Block Storage.

creds = {}
creds['tns_admin'] = <path_to_wallet_folder>
creds['sid'] = <your SID>
creds['user'] = <database username>
creds['password'] = <database password>

Once your Oracle client is setup, you can use cx_Oracle directly with pandas as in this example:

import pandas as pd
import cx_Oracle
import os

os.environ['TNS_ADMIN'] = creds['tns_admin']
with cx_Oracle.connect(creds['user'], creds['password'], creds['sid']) as ora_conn:
  df = pd.read_sql('''
    SELECT ename, dname, job, empno, hiredate, loc
    FROM emp, dept
    WHERE emp.deptno = dept.deptno
    ORDER BY ename
  ''', con=ora_conn)

You can also use cx_Oracle within ADS by creating a connection string:

os.environ['TNS_ADMIN'] = creds['tns_admin']
from ads.dataset.factory import DatasetFactory
uri = 'oracle+cx_oracle://' + creds['user'] + ':' + creds['password'] + '@' + creds['sid']
ds = DatasetFactory.open(uri, format="sql", table=table, index_col=index_col)


Oracle ADB Logo

Oracle has two configurations of Autonomous Databases. They are the Autonomous Data Warehouse (ADW) and the Autonomous Transaction Processsing (ATP) database. Both are fully autonomous databases that scale elastically, deliver fast query performance, and require minimal database administration.


To access ADW, review Setup for ADB in Configuration. It shows you how to get the client credentials (wallet) and set up the proper environment variable.

After the notebook environment has been configured to access ADW, you can use ADS to:

Loading Data from ADB

After you have stored the ADB username, password and database name (SID) as variables, you can build the uri as your connection source.

uri = 'oracle+cx_oracle://' + creds['user'] + ':' + creds['password'] + '@' + creds['sid']

You can use ADS to query a table from your database, and then load that table as an ADSDataset object through DatasetFactory. When you open DatasetFactory, specify the name of the table you want to pull using the table variable for a given table. For SQL expressions, use the table parameter also. For example, (`table=”SELECT * FROM sh.times WHERE rownum <= 30”`).

os.environ['TNS_ADMIN'] = creds['tns_admin']
ds = DatasetFactory.open(uri, format="sql", table=table, target='label')

Querying Data from ADB

  • Query using pandas

    This example shows you how to query data using pandas and sqlalchemy to read data from ADB:

    from sqlalchemy import create_engine
    import os
    os.environ['TNS_ADMIN'] = creds['tns_admin']
    engine = create_engine(uri)
    df = pd.read_sql('SELECT * from <TABLENAME>', con=engine)

    You can convert the pd.DataFrame into ADSDataset using the DatasetFactory.from_dataframe() function.

    ds = DatasetFactory.from_dataframe(df)

    These two examples run a simple query on ADW data. With read_sql_query you can use SQL expressions not just for tables, but also to limit the number of rows and to apply conditions with filters, such as (where).

    ds = pd.read_sql_query('SELECT * from <TABLENAME>', uri)
    ds = pd.read_sql_query('SELECT * FROM emp WHERE ROWNUM <= 5', uri)
  • Query using cx_Oracle

    You can also query data from ADW using cx_Oracle. Use the cx_Oracle 7.0.0 version with ADS. Ensure that you change the dummy <TABLENAME> placeholder to the actual table name you want to query data from, and the dummy <COLNAME> placeholder to the column name that you want to select:

    import pandas as pd
    import numpy as np
    import os
    os.environ['TNS_ADMIN'] = creds['tns_admin']
    connection = cx_Oracle.connect(creds['user'], creds['password'], creds['sid'])
    cursor = connection.cursor()
    results = cursor.execute("SELECT * from <TABLENAME>")
    data = results.fetchall()
    df = pd.DataFrame(np.array(data))
    ds = DatasetFactory.from_dataframe(df)
    results = cursor.execute('SELECT <COLNAME> from <TABLENAME>').fetchall()

Don’t forget to close the cursor and connection using the close method:


Training Models with ADB

After you load your data from ADB, the ADSDataset object is created, which allows you to build models using AutoML.

from ads.automl.driver import AutoML
from ads.automl.provider import OracleAutoMLProvider

train, test = ds.train_test_split()
model, baseline = AutoML(train, provider= OracleAutoMLProvider()).train(model_list=["LGBMClassifier"])

Updating ADB Tables with Model Predictions

To add predictions to a table, you can either update an existing table, or create a new table with the added predictions. There are many ways to do this. One way is to use the model to update a CSV file, and then use Oracle SQL*Loader or SQL*Plus.

This example adds predictions programmatically using cx_Oracle. It uses executemany to insert rows as tuples created using the model’s predict method:

ds = DatasetFactory.open("iris.csv")

create_table = '''CREATE TABLE IRIS_PREDICTED (,
                        sepal_length number,
                        sepal_width number,
                        petal_length number,
                        petal_width number,
                        SPECIES VARCHAR2(20),
                        yhat VARCHAR2(20),

connection = cx_Oracle.connect(creds['user'], creds['password'], creds['sid'])
cursor = connection.cursor()

ds_res.to_sql('predicted_iris', con=engine, index=False, if_exists="append")\

rows = [tuple(x) for x in ds_res.values]

  insert into IRIS_PREDICTED
    (sepal_length, sepal_width, petal_length, petal_width, SPECIES, yhat)
  values (:1, :2, :3, :4, :5, :6)""",


For some models, you could also use predict_proba to get an array of predictions and their confidence probability.

Amazon S3

You can open Amazon S3 public or private files in ADS. For private files, you must pass the right credentials through the ADS storage_options dictionary.If you have large S3 files, then you benefit from an increased blocksize.

ds = DatasetFactory.open("s3://bucket_name/iris.csv", storage_options = {
    'key': 'aws key',
    'secret': 'aws secret,
    'blocksize': 1000000,
    'client_kwargs': {
            "endpoint_url": "https://s3-us-west-1.amazonaws.com"

HTTP and HTTPs Sources

To open a dataset from a remote web server source, use DatasetFactory.open() and specify the URL of the data:

ds = DatasetFactory.open('https://example.com/path/to/data.csv', target='label')

Loading Data from DatasetBrowser

DatasetBrower allows easy access to datasets from reference libraries and index websites, such as scikit-learn. To see the supported libraries, use the list() function:

['web', 'sklearn', 'seaborn', 'R']

To see which dataset is available from scikit-learn, use:

sklearn = DatasetBrowser.sklearn()
['boston', 'breast_cancer', 'diabetes', 'iris', 'wine', 'digits']

Datasets are provided as a convenience. Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services. Review the dataset license.

To explore one of the datasets, use open() specifying the name of the dataset:

ds = sklearn.open('wine')

Loading Datasets of Various Format Types

You can load data with different formats into DatasetFactory, see Loading Data in Loading Data. Following are some examples.


You can pass the pandas.DataFrame object directly into the ADS DatasetFactory.open method:

import pandas as pd
from ads.dataset.factory import DatasetFactory

df = pd.read_csv('/path/some_data.csv) # load data with Pandas

# use open...

ds = DatasetFactory.open(df) # construct **ADS** Dataset from DataFrame

# alternative form...

ds = DatasetFactory.from_dataframe(df)

# an example using pandas to parse data on the clipboard as a CSV and construct an ADS Dataset object
# this allows easily transfering data from an application like Microsoft Excel, Apple Numbers, etc.

ds = DatasetFactory.from_dataframe(pd.read_clipboard())

# use pandas to query a SQL database:

from sqlalchemy import create_engine
engine = create_engine('dialect://user:pass@host:port/schema', echo=False)
df = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')
ds = DatasetFactory.from_dataframe(df)

You can also use a Dask.DataFrame in the same way. More panadas information.


You can convert an array into a pandas DataFrame and then open it with DatasetFactory:

generated_data_arr = [["ID", "Name", "GPA"], [1, "Bob", 3.7], [2, "Sam", 4.3], [3, "Erin", 2.6]]
generated_df1 = pd.DataFrame(generated_data_arr[1:], columns=generated_data_arr[0])
generated_ds1 = DatasetFactory.open(generated_df1)


You can convert a dictionary into a pandas DataFrame and then open it with DatasetFactory:

generated_data_dict = {"ID": [1.1, 2.0, 3.0],
                     "Name": ["Bob", "Sam", "Erin"],
                     "GPA": [3.7, 4.3, 2.6]}
generated_df2 = pd.DataFrame(generated_data_dict)
generated_ds2 = DatasetFactory.open(generated_df2)


You can load a csv file into Dataset Factory using open():

ds = DatasetFactory.open("data/multiclass_fk_10k.csv")


If your dataset does not include a header, then DatasetFactory assumes that each feature is named according to the corresponding column from your first data-point. This feature naming may be undesirable and could lead to subtle bugs appearing. Many CSVs use spaces for readability, which can lead to trouble when trying to set your target variable within DatasetFactory.open().

The work around for this is to pass header=None to DatasetFactory:

ds = DatasetFactory.open("sample_data.csv", header=None)

All of your columns are given integer names beginning with 1.


You can open a tsv or a file with any arbitrary separation key with DatasetFactory, using open(). This is an example of a tsv file being generated and opening it with DatasetFactory:

f = open("tmp_random_ds99.tsv","w+")
f.write('1 \t 2 \t 3 \t 4 \t 5 \t 6 \n 1.1 \t 2.1 \t 3.1 \t 4.1 \t 5.1 \t 6.1')

ds = DatasetFactory.open("tmp_random_ds99.tsv", column_names=['a','b','c','d','e','f'])

Delimited Files

CSV and TSV are the most common delimited files. However, files can have other forms of delimitation. To read them with the DatasetFactory.open() method, the delimiter parameter must be given with the delimiting value. DatasetFactory.open() considers all delimited files as CSV so the format=csv or format=csv parameter must also be specified even though the delimiter is not a comma or tab. DatasetFactory.open() attempts to determine the column names from the first line of the file. Alternatively, the column_names option can be used to specify them.

In this example, a file is created that is delimited with a vertical bar (|), and then read in with the DatasetFactory.open() method.

# Create a delimited file with a '|' as a separator
file = tempfile.NamedTemporaryFile()
for i in range(5):
    for j in range(7):
        term = '|' if j != 6 else '\n'
        file.write(bytes('{}.{}'.format(i, j) + term, 'utf-8'))

# Print the raw file
for line in file:

# Read in the delimited file and specify the column names.
ds = DatasetFactory.open(file.name, delimiter='|', format='csv', column_names=['a','b','c','d','e','f'])


JSON files are supported by DatasetFactory.open() as long as the data can be restructured into a rectangular form. There are two supported formats of JSON that are called orientations. The orientation is given by orient=index or orient=records.

For the index orientation, there is a single JSON object. The format is:

    <index>: <value>,
    <index>: <value>

For example:

    "946684800000": {"id": 982, "name": "Yvonne", "x": -0.3289461521, "y": -0.4301831275},
    "946684801000": {"id": 1031, "name": "Charlie", "x": 0.9002882524, "y": -0.2144513329}

For the records format, there is a collection of JSON objects. No index value is given and there is no comma between records. The format is:

{<key>: <value>, <key>: <value>}
{<key>: <value>, <key>: <value>}

For example:

{"id": 982, "name": "Yvonne", "x": -0.3289461521, "y": -0.4301831275}
{"id": 1031, "name": "Charlie", "x": 0.9002882524, "y": -0.2144513329}

In this example, a JSON file is created then read back in with DatasetFactory.open(). If the file extension ends in .json, then the method loads it as a JSON file. If this is not the case, then set format=json.

# Create the JSON file that is to be read
[file] = DatasetBrowser.sklearn().open('wine').to_json(path.join(tempfile.mkdtemp(), "wine.json"),
                                                       orient='records', npartitions=1)

# Read in the JSON file
ds = DatasetFactory.open(file, format='json', orient='records')

Excel xls and xlsx

Data scientists often have to work with Excel files as a data source. If the file extension is .xlsx, then DatasetFactory.open() automatically processes it as an Excel file. If not, the format=xlsx can be used. By default, the first sheet in the file is read in. This behavior can be modified with the sheetname parameter. It accepts the sheet number (it is zero-indexed) or a string with the name of the sheet. DatasetFactory.open() reads in all columns that have values. This behavior can be modified with the usecols parameter. It accepts a list of column numbers to be read in, such as usecols=[1, 3, 5] or it can accept a range as a string, usecols=A:C.

# Create the Excel file to read in. Put the data on a sheet called 'wine'
file = tempfile.NamedTemporaryFile()
writer = pd.ExcelWriter(file.name, engine='xlsxwriter')
DatasetBrowser.sklearn().open('wine').to_pandas_dataframe().to_excel(writer, sheet_name='wine')

# Read in the Excel file and clean up
ds = DatasetFactory.open(file.name, format='xlsx', sheetname='wine', usecols="A:C")


You can load an HDF file into DatasetFactory. This example builds an HDF file, and then opens it with DatasetFactory:

[ds_loc] = ds.to_hdf("tmp_random_ds99.h5", key='df')
ds_copy = DatasetFactory.open(ds_loc, key='df')


You can read Parquet files in ADS. This example builds a Parquet folder, and then opens it with DatasetFactory:

ds_copy = DatasetFactory.open("tmp_random_ds99", format='parquet')


You can load ARFF file into DatasetFactory. The file format is recognized from the file name. You can load the file from internet:

ds = DatasetFactory.open('https://*example.com/path/to/some_data.arff*')

Specify Data Types in Load Dataset

When you open a dataset, ADS detects data types in the dataset. The ADS semantic dtypes assigned to features in dataset, can be:

  • categorical

  • continuous

  • datetime

  • ordinal

ADS semantic dtypes are based on ADS low-level dtypes. They match with the pandas dtypes ‘object’, ‘int64’, ‘float64’, ‘datetime64’, ‘category’, and so on. When you use an open() statement for a dataset, ADS detects both its semantic and low-level data types. This example specifies the low-level data type, and then ADS detects its semantic type:

import pandas as pd
from ads.dataset.factory import DatasetFactory

df = pd.DataFrame({
        'numbers': [5.0, 6.0, 8.0, 5.0],
        'years': [2007, 2008, 2008, 2009],
        'target': [1, 2, 3, 3]

ds = DatasetFactory.open(
        target = 'numbers',
        types = {'numbers': 'int64'}

You can inspect low level and semantic ADS dtypes with the feature_types property:

# print out detailed information on each column

# print out ADS "semantic" dtype of a column

# print out ADS "low-level" dtype of a column

You can also get the summary information on a dataset, including its feature details in a notebook output cell with show_in_notebook:


Use numpy.dtype or pandas dtypes in types parameter to specify your data type. When you update a type, ADS changes both the semantic and the low-level types.

You can either specify a semantic or a low-level data type for types. This example shows how to load a dataset with various types of data:

ds = DatasetFactory.open(
        target = 'years',
        types = {'years': 'datetime'}
ds = DatasetFactory.open(
        target = 'target',
        types = {'target': 'categorical'}

You can find more examples about how to change column data types in Changing Data Types of Columns.