Implement Simple Hybrid Search for Retrieval Augmented Generation using Oracle Database 23ai

Introduction

This tutorial demonstrates how to implement a hybrid search as part of a Retrieval Augmented Generation (RAG) process using both vector and keyword search in Oracle 23ai.

RAG has emerged as a key capability for enterprises using Large Language Models (LLMs) to augment their responses with specific business or domain information. Searching with an enterprise knowledge base for information relevant to a user’s query then attaching the retrieved information to the request to the LLM allows for responses that draw upon internal data, policies, and scenario specific information, with the specific information reducing the likelihood of hallucination, allowing for a natural language response which includes appropriate citations and references to documents in the knowledge base.

The Oracle Database includes multiple powerful capabilities for performing RAG tasks. The Oracle Database 23ai release introduced an AI Vector Search capability, which provides the ability to perform fast semantic searches on unstructured data. Using semantic search with high quality vector embeddings can seem almost magical, with almost any query summoning highly relevant documents from a huge knowledge base. However, just because vector search is available and provides high quality results in most scenarios does not mean that traditional keyword-based search should be abandoned. Any developer who has spent much time testing the retrieval has certainly discovered some oddities, in which documents covering the specific subject asked about and which would intuitively be included in the response are not, even though they would be trivially found by a keyword search.

So, why not use both?

Oracle Database 23ai builds upon all the powerful capabilities that have been added to the database over time, including Oracle Text, which provides rich text query capabilities, and this tutorial is designed to demonstrate how the existence of both these capabilities in the database makes it incredibly simple to implement a robust hybrid search, providing the best of both worlds without any data duplication.

Note: This tutorial demonstrates implementing the logic for hybrid search in Python, using a local embedding model. Oracle Database 23ai supports calculating text embeddings in the database through the use of ONNX models, and there is native support for hybrid search using an ONNX model in the database through a Database Management System (DBMS) package. Implementing the logic directly in Python provides much greater control over the behaviour of the search, however the DBMS package offers a simple yet powerful set of capabilities for some use cases. For more information, see Import ONNX Models into Oracle Database End-to-End Example and Understand Hybrid Search.

Objectives

Set up a database table for hybrid search.
Implement a simple document ingestion process in Python.
Implement vector search and keyword search of the documents.

Prerequisites

Access to an Oracle Database 23ai database instance, with one or more users who can create tables and indexes.
A Python runtime with the ability to connect to the database.

Note: This tutorial uses Python for interacting with the Oracle Database, as it is assumed that hybrid search of documents will be implemented as part of a broader RAG process, however the key capabilities are demonstrated using only SQL, which should allow for the approach to be applied in other development languages.

Task 1: Set Up the Database Table

The table used for hybrid search can be identical to a table used for vector search, as Oracle Text can index Character Large Object (CLOB) fields, which are typically used to store the document content.

Note: The SQL for the initial table setup is shown here directly, as opposed to being invoked from Python. The database user account employed by your RAG process should only have permissions to query the table and not create tables and indexes, as such these tasks will be performed by a database administrator, using their preferred tools.

After connecting to the database, create a table to use for storing the documents used by the RAG process using the following SQL.
```
CREATE TABLE hybridsearch 
(id RAW(16) DEFAULT SYS_GUID() PRIMARY KEY,
text CLOB, 
embeddings VECTOR(768, FLOAT32),
metadata JSON);
```
The vector column size is dependent upon the embedding model which is going to be used to generate the vectors for semantic search. Here we are using 768, which corresponds with the vector model used later in this example, though if an alternative model is used, this value may need to be updated to reflect that change. A JSON column is specified for storing document metadata, as this can provide flexible structures while still permitting filtering on the attributes of the document, and though it is not used in this tutorial, it is included as any real-world scenario will require document metadata.
To enable keyword searching of the text, a text index needs to be created on the text column.
```
CREATE SEARCH INDEX rag_text_index ON hybridsearch (text);
```

Task 2: Install Libraries

This tutorial illustrates using a Python runtime to implement document ingestion and hybrid search. It is recommended that you configure your Python runtime using a venv or conda environment.

Note: This tutorial attempts to introduce sections of code as required to demonstrate each concept, and will require refactoring if it were to be incorporated into a broader solution.

Install the dependencies required for this tutorial using pip.

$ pip install -U oracledb sentence-transformers git+https://github.com/LIAAD/yake

Task 3: Ingest Documents into the Database

Once the table has been created, documents can be inserted as rows into the table. Typically, the ingestion process should be separate to the query process, and use different database accounts with different permissions (as the query process should not be able to modify the table), however for the purposes of this tutorial, they are not differentiated here.

In your Python environment, establish your connection to the database. For example, for Autonomous Transaction Processing.

import os
import oracledb
import traceback
import json 
import re
   
try:
    print(f'Attempting to connect to the database with user: [{os.environ["DB_USER"]}] and dsn: [{os.environ["DB_DSN"]}]')
    connection = oracledb.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], dsn=os.environ["DB_DSN"],
                                config_dir="/path/to/dbwallet",
                                wallet_location="/path/to/dbwallet",
                                wallet_password=os.environ["DB_WALLET_PASSWORD"])
    print("Connection successful!")
except Exception as e:
    print(traceback.format_exc())
    print("Connection failed!")

The python-oracledb documentation provides details on connecting to non-ADB instances which may not use connection wallets.

Initialize the embedding model which will be used to calculate embedding vectors. Here the all-mpnet-base-v2 model is being used, which is available under the Apache license. Though this specific embedding model is only used for illustration, other models may perform better or worse depending upon your data. This example is using the SentenceTransformers interface for simplicity. For more information, see SentenceTransformers Documentation.
```
from sentence_transformers import SentenceTransformer
   
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
```

Implement a simple document ingestion function. The process of obtaining, parsing and chunking documents is out of scope for this tutorial, and for the purpose of this tutorial it is assumed they will just be provided as strings. This function calculates the embeddings using the provided model, then inserts the document and embeddings into the created table.

def add_document_to_table(connection, table_name, model, document, **kwargs):
    """
    Adds a document to the database for use in RAG.
    @param connection An established database connection.
    @param table_name The name of the table to add the document to
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param document The document to add, as a string
    Keyword Arguments:
    metadata: A dict with metadata about the document which is stored as a JSON object
    """
    #Calculate the embeddings for the document
    embeddings = model.encode(document)
    insert_sql = f"""INSERT INTO {table_name} (text, embeddings, metadata) VALUES (:text, :embeddings, :metadata)"""
    metadata = kwargs.get('metadata', {})
    cursor = connection.cursor()
    try:
        cursor.execute(insert_sql, text=document, embeddings=json.dumps(embeddings.tolist()), metadata=json.dumps(metadata))
    except Exception as e:
        print(traceback.format_exc())
        print("Insert failed!")

Add some sample documents to the database for testing.

Note: An explicit commit() is invoked, which triggers the update of the text index.

table_name = "testhybrid"
   
# These samples are just included to provide some data to search, not to 
# demonstrate the efficacy of key phrase versus semantic search. The benefits
# of hybrid search typically start to emerge when using a much larger volume
# of content.
document_samples = [
    "Oracle Database 23ai is the next long-term support release of Oracle Database. It includes over 300 new features with a focus on artificial intelligence (AI) and developer productivity.",
    "Features such as AI Vector Search enable you to leverage a new generation of AI models to generate and store vectors of documents, images, sound, and so on;  index them and quickly look for similarity while leveraging the existing analytical capabilities of Oracle Database.",
    "New developer-focused features now make it simpler to build next-generation applications that use JSON or relational development approaches or both interchangeably.",
    "With a built-in VECTOR data type, you can run AI-powered vector similarity searches within the database instead of having to move business data to a separate vector database.",
    "Property graphs provide an intuitive way to find direct or indirect dependencies in data elements and extract insights from these relationships. The enterprise-grade manageability, security features, and performance features of Oracle Database are extended to property graphs.",
    "The ISO SQL standard has been extended to include comprehensive support for property graph queries and creating property graphs in SQL.",
    "Transactional Event Queues (TxEventQ) are queues built into the Oracle Database. TxEventQ are a high performance partitioned implementation with multiple event streams per queue.",
    "Transactional Event Queues (TxEventQ) now support the KafkaProducer and KafkaConsumer classes from Apache Kafka. Oracle Database can now be used as a source or target for applications using the Kafka APIs.",
    "Database metrics are stored in Prometheus, a time-series database and metrics tailored for developers are displayed using Grafana dashboards. A database metrics exporter aids the metrics exports from database views into Prometheus time series database."
    "The Java Database Connectivity (JDBC) API is the industry standard for database-independent connectivity between the Java programming language and a wide range of databases—SQL databases and other tabular data sources, such as spreadsheets or flat files.",
    "Java Database Connectivity (JDBC) is a Java standard that provides the interface for connecting from Java to relational databases. The JDBC standard is defined and implemented through the standard java.sql interfaces. This enables individual providers to implement and extend the standard with their own JDBC drivers.",
    "The JDBC Thin driver enables a direct connection to the database by providing an implementation of Oracle Net Services on top of Java sockets. The driver supports the TCP/IP protocol and requires a TNS listener on the TCP/IP sockets on the database server.",
    "The JDBC Thin driver is a pure Java, Type IV driver that can be used in applications. It is platform-independent and does not require any additional Oracle software on the client-side. The JDBC Thin driver communicates with the server using Oracle Net Services to access Oracle Database.",
    "The JDBC OCI driver, written in a combination of Java and C, converts JDBC invocations to calls to OCI, using native methods to call C-entry points. These calls communicate with the database using Oracle Net Services.",
    "The python-oracledb driver is a Python extension module that enables access to Oracle Database. By default, python-oracledb allows connecting directly to Oracle Database 12.1 or later. This Thin mode does not need Oracle Client libraries.",
    "Users interact with a Python application, for example by making web requests. The application program makes calls to python-oracledb functions. The connection from python-oracledb Thin mode to the Oracle Database is established directly.",
    "Python-oracledb is said to be in ‘Thick’ mode when it links with Oracle Client libraries. Depending on the version of the Oracle Client libraries, this mode of python-oracledb can connect to Oracle Database 9.2 or later.",
    "To use python-oracledb Thick mode, the Oracle Client libraries must be installed separately. The libraries can be from an installation of Oracle Instant Client, from a full Oracle Client installation (such as installed by Oracle’s GUI installer), or even from an Oracle Database installation (if Python is running on the same machine as the database).",
    "Oracle’s standard client-server version interoperability allows connection to both older and newer databases from different Oracle Client library versions."
]
   
for document in document_samples:
    add_document_to_table(connection, table_name, model, document)
   
#Call an explicit commit after adding the documents, which will trigger an async update of the text index
connection.commit()

Task 4: Implement Oracle Database 23ai AI Vector Search

Once the documents have been loaded, the Oracle Database 23ai AI Vector Search capabilities can be used to perform a semantic search based upon a vector we derive from a query.

Implement a helper function to work with the CLOB objects returned by the database.

def get_clob(result):
    """
    Utility function for getting the value of a LOB result from the DB.
    @param result Raw value from the database
    @returns string
    """
    clob_value = ""
    if result:
        if isinstance(result, oracledb.LOB):
            raw_data = result.read()
            if isinstance(raw_data, bytes):
                clob_value = raw_data.decode("utf-8")
            else:
                clob_value = raw_data
        elif isinstance(result, str):
            clob_value = result
        else:
            raise Exception("Unexpected type:", type(result))
    return clob_value

Implement a function for performing the semantic search using the vector_distance() SQL function. The all-mpnet-base-v2 model used in this tutorial uses COSINE similarity, which has been defaulted here. If you use a different model, you may need to specify an alternative distance strategy.

def retrieve_documents_by_vector_similarity(connection, table_name, model, query, num_results, **kwargs):
    """
    Retrieves the most similar documents from the database based upon semantic similarity.
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param query The string to search for semantic similarity with
    @param num_results The number of results to return
    Keyword Arguments:
    distance_strategy: The distance strategy to use for comparison One of: 'EUCLIDEAN', 'DOT', 'COSINE' - Default: COSINE
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    # In many cases, building up the search SQL may involve adding a WHERE 
    # clause in order to search only a subset of documents, though this is
    # omitted for this simple example.
    search_sql = f"""SELECT id, text, metadata,
                    vector_distance(embeddings, :embedding, {kwargs.get('distance_strategy', 'COSINE')}) as distance
                    FROM {table_name}
                    ORDER BY distance
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    query_embedding = model.encode(query)
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, embedding=json.dumps(query_embedding.tolist()))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
   
    return documents

Validate the semantic search capability using the following sample.

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_vector_search = retrieve_documents_by_vector_similarity(connection, table_name, model, query, 4)
print(documents_from_vector_search)

Task 5: Implement Keyword Search

This tutorial uses Oracle Text, which provides powerful text query tooling within the Oracle Database. While Oracle Text provides a wide range of capabilities for the purposes of hybrid search, all that is required is a simple search by keywords or key phrases. There are a number of techniques for keyword extraction and searching, however this implementation is intended to be as simple as possible, using Yet Another Keyword Extractor (YAKE), which relies upon language features to perform unsupervised extraction of keywords and key phrases.

There are a wide range of other approaches to keyword search, with the Okapi BM25 algorithm being popular. However, using unsupervised keyword extraction with a powerful text search index such as that provided by Oracle Text has the advantage of being particularly simple, and robustness is provided via the combination with semantic search.

Implement a function for key phrase extraction.

import yake
   
def extract_keywords(query, num_results):
    """
    Utility function for extracting keywords from a string.
    @param query The string from which keywords should be extracted
    @param num_results The number of keywords/phrases to return
    @returns Array<(string, number)> Array of keywords/phrases as a tuple of 'keyword', 'score' (lower scores are more significant)
    """
    language = "en"
    #Max number of words to include in a key phrase
    max_ngram_size = 2
    windowSize = 1
   
    kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, windowsSize=windowSize, top=num_results, features=None)
    keywords = kw_extractor.extract_keywords(query.strip().lower())
    return sorted(keywords, key=lambda kw: kw[1])

As this method of keyword extraction relies upon language features, setting the appropriate language is important, and the performance may vary depending upon the language itself.

Validate the keyword extraction using the following sample.

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
keywords = extract_keywords(query, 4)
print(keywords)

Implement a function for performing keyword based search.

def retrieve_documents_by_keywords(connection, table_name, query, num_results):
    """
    Retrieves the documents from the database which have the highest density of matching keywords as the query
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param query The string from which to extract keywords/phrases for searching
    @param num_results The number of results to return
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    num_keywords = 4
    keywords = extract_keywords(query, num_keywords)
    search_sql = f"""SELECT id, text, metadata, SCORE(1)
                    FROM {table_name}
                    WHERE CONTAINS (text, :query_keywords, 1) > 0
                    ORDER BY SCORE(1) DESC
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    #Assemble the keyword search query, adding the stemming operator to each word
    stemmed_keywords = []
    splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
    for keyword in keywords:
        stemmed_keyword = ""
        for single_word in splitter.split(keyword[0]):
            stemmed_keyword += "$" + single_word +" "
        stemmed_keywords.append(stemmed_keyword.strip())
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, query_keywords=",".join(stemmed_keywords))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
    return documents

One of the simplest behaviours in Oracle Text is performing keyword searches via the CONTAINS function, which supports a wide range of additional operators to refine or broaden the search. In this tutorial, the stemming operator is used. Expands a query to include all terms having the same stem or root word as the specified term. This is used to normalize words regardless of plurality and tense, to allow cat to match cats for instance. For more information, see Oracle Text CONTAINS Query Operators.

Note: If applying this to a large corpus of documents, it is advised to configure the text index to include word stems to improve performance. For more information on Basic Lexer, see BASIC_LEXER.

Validate the keyword based search using the following sample.

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_keyphrase_search = retrieve_documents_by_keywords(connection, table_name, query, 4)
print(documents_from_keyphrase_search)

Combine and Use the Results

Once the documents have been obtained, they can be provided to an LLM as additional context which it can use to respond to queries or instructions. In some scenarios it may be appropriate to simply include all retrieved documents in the prompt to the LLM. In other cases, the lack of relevant documents is an important piece of context in and of itself, and so it may be important to determine their relevance. This may be based upon a particular weighting that is placed upon each type of search, or the relevance of each document can be assessed independently of how it was returned using a re-ranking model.

In each of these use cases, deduplicating the results will be an important step. Each function has been preserving the document id which provides a unique identifier which can be used for this purpose. For example:

def deduplicate_documents(*args):
    """
    Combines lists of documents returning a union of the lists, with duplicates removed (based upon an 'id' match)
    Arguments:
        Any number of arrays of documents as a tuple of 'id', 'text', 'metadata'
    @returns: Array<(string, string, dict)> Single array of documents containing no duplicates
    """
    #Definitely not the most efficient de-duplication, but in this case, lists are typically <10 items
    documents = []
    for document_list in args:
        for document in document_list:
            if document[0] not in map(lambda doc: doc[0], documents):
                documents.append(document)
    return documents

These two methods enable extraction of relevant documents from the same database table, using two different mechanisms for determining similarity. Both methods are very fast to execute, so there is minimal overhead beyond the text index to applying both techniques to retrieve relevant documents. The semantic search allows for retrieval of documents regardless of the use of synonyms or occasional typo, while key phrase search can capture scenarios in which the user is very specifically asking about a particular subject, such as a product or function name. The combined results can complement each other to add robustness to the overall RAG process.

Acknowledgments

Author - Callan Howell-Pavia (APAC Solutions Specialist)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Implement Simple Hybrid Search for Retrieval Augmented Generation using Oracle Database 23ai

G18640-01

November 2024

Oracle and/or its affiliates.

Implement Simple Hybrid Search for Retrieval Augmented Generation using Oracle Database 23ai

Introduction

Objectives

Prerequisites

Task 1: Set Up the Database Table

Task 2: Install Libraries

Task 3: Ingest Documents into the Database

Task 4: Implement Oracle Database 23ai AI Vector Search

Task 5: Implement Keyword Search

Combine and Use the Results

Related Links

Acknowledgments

More Learning Resources