Using Oracle RDBMS with Pandas in OCI Data Science Notebooks

When using the Oracle RDBMS with Python the most common representation of tabular data is  a Pandas dataframe. When you’re in a dataframe, you can perform many operations from visualization to persisting in a variety of formats.

Reading data from the Oracle RDBMS to a Pandas dataframe

The Pandas read_sql(...) function is a general, database independent approach that uses the SQLAlchemy - Object Relational Mapper to arbitrate between specific database types and Pandas.

Read SQL query or database table into a dataframe.

This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility). It delegates to the specific function depending on the provided input. A SQL query is routed to read_sql_query, while a database table name is routed to read_sql_table.

ADS (2.3.1+) (found in the “Data Exploration and Manipulation for CPU V2”conda environment) recommends using the ADS provided drop-in alternative. This can be up to 15 times faster than Pandas.read_sql() because it bypasses the ORM, and is written to take advantage of being specific for the Oracle RDBMS.

Use the Pandas ADS accessor drop-in replacement, pd.DataFrame.read_sql(...), instead of using pd.read_sql.

Examples of querying data from Oracle RDBMS

connection_parameters = {
    "user_name": "<username>",
    "password": "<password>",
    "service_name": "<service_name_{high|med|low}>",
    "wallet_location": "/full/path/to/my_wallet.zip",
}
import pandas as pd
import ads

# simple read of a SQL query into a dataframe with no bind variables
df = pd.DataFrame.ads.read_sql(
    "SELECT * FROM SH.SALES",
    connection_parameters=connection_parameters,
)

# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
    """
    SELECT
    *
    FROM
    SH.SALES
    WHERE
        ROWNUM <= :max_rows
    """,
    bind_variables={
        max_rows : 100
    }
    ,
    connection_parameters=connection_parameters,
)

Performance

The performance is limited by three things:

  • Generational latency: How long the database takes to return rows, use of indexes and writing efficient SQL mitigates this performance bottleneck.

  • Network saturation: Once the network is saturated, data can’t be delivered between the database and notebook environment any faster. OCI networking is very fast and this isn’t usually a concern. One exception is when the network path goes over VPN or other more complex routing topologies.

  • CPU latency in the notebook: Python has to collect the byte stream delivered by the database into Python data types before being promoted to Numpy objects for Pandas. Additionally, there is a cryptographic CPU overhead because the data in transit is secured with public key infrastructure (PKI)

Large result sets

If a database query returns more rows than the memory of the client permits, you have a a couple of easy options. The simplest is to use a larger client shape, along with increased compute performance because larger shapes come with more RAM. If that’s not an option, then you can use the pd.DataFrame.ads.read_sql mixin in chunk mode, where the result is no longer a Pandas dataframe it is an iterator over a sequence of dataframes. You could use this read a large data set and write it to Object storage or a local file system with the following example:

for i, df in enumerate(pd.DataFrame.ads.read_sql(
        "SELECT * FROM SH.SALES",
        chunksize=100000 # rows per chunk,
        connection_parameters=connection_parameters,
      ))
   # each df will contain up to 100000 rows (chunksize)
   # to write the data to object storage use oci://bucket#namespace/part_{i}.csv"
   df.to_csv(f"part_{i}.csv")

Very large result sets

If the data exceeds what’s practical in a notebook, then the next step is to use the Data Flow serviceto partition the data across multiple nodes and handle data of any size up to the size of the cluster.

Writing data to the Oracle RDBMS (from a Pandas dataframe)

Typically, you would do this using df.to_sql. However, this uses the ORM to collect data and is less efficient than code that has been optimized for a specific database.

Instead, use the Pandas ADS accessor mixin.

With a dfdataframe, writing this to the database is as simple as:

df.ads.to_sql(
    "MY_TABLE",
    connection_parameters=connection_parameters,
    if_exists="replace"
)

The resulting data types (if the table was created by ADS, as opposed to inserting into an existing table), are governed by the following:

Pandas

Oracle

bool

NUMBER(1)

int16

INTEGER

int32

INTEGER

int64

INTEGER

float16

FLOAT

float32

FLOAT

float64

FLOAT

datetime64

TIMESTAMP

string

VARCHAR2 (Maximum length of the actual data.)

When a table is created, the length of any VARCHAR2 column is computed from the longest string in the column. The ORM defaults to CLOB data, which is not correct or efficient. CLOBS are stored efficiently by the database, but the c API to query them works differently. The non-LOB columns are returned to the client through a cursor, but LOBs are handled differently resulting in an additional network fetch per row, per LOB column. ADS deals with this by creating the correct data type, and setting the correct VARCHAR2 length.