使用 Oracle Database 23ai 为检索增强生成实施简单的混合搜索

简介

本教程演示如何使用 Oracle 23ai 中的向量和关键字搜索在检索增强生成 (RAG) 流程中实施混合搜索。

RAG 已成为使用大型语言模型 (LLM) 的企业的关键功能，以通过特定的业务或领域信息来增强其响应。通过企业知识库搜索与用户查询相关的信息，然后将检索到的信息附加到 LLM 请求，可以根据内部数据、策略进行响应以及情景特定信息，具体信息降低了幻觉的可能性，允许自然语言反应，包括知识库中对文档的适当引用和引用。

Oracle Database 包含多项用于执行 RAG 任务的强大功能。Oracle Database 23ai 发行版引入了 AI Vector Search 功能，支持对非结构化数据执行快速语义搜索。使用高质量矢量嵌入的语义搜索看起来几乎是神奇的，几乎任何查询都从庞大的知识库中汇总高度相关的文档。然而，仅仅因为矢量搜索是可用的，并且在大多数情况下提供高质量的结果并不意味着传统的基于关键字的搜索应该被放弃。任何花费大量时间测试检索的开发人员肯定已经发现了一些奇怪之处，其中涵盖特定主题的文档被问及，哪些内容会直观地包含在响应中，即使它们会被关键字搜索简单地发现。

那么，为什么不使用两者呢？

Oracle Database 23ai 基于随着时间的推移添加到数据库中的所有强大功能，包括提供丰富文本查询功能的 Oracle Text，本教程旨在演示数据库中的这两种功能如何使实施强大的混合搜索变得非常简单，从而在没有任何数据重复的情况下提供两全其美的功能。

注意：本教程演示了如何使用本地嵌入模型在 Python 中实施混合搜索逻辑。Oracle Database 23ai 支持使用 ONNX 模型计算数据库中的文本嵌入，并通过数据库管理系统 (Database Management System，DBMS) 程序包原生支持在数据库中使用 ONNX 模型进行混合搜索。直接在 Python 中实现逻辑可以更好地控制搜索的行为，但是 DBMS 程序包为某些用例提供了一组简单而强大的功能。有关更多信息，请参见 Import ONNX Models into Oracle Database End-to-End Example 和 Understand Hybrid Search 。

目标

设置混合搜索的数据库表。
在 Python 中实施简单的文档摄取过程。
实施向量搜索和文档的关键字搜索。

先决条件

访问 Oracle Database 23ai 数据库实例，由一个或多个可以创建表和索引的用户访问。
能够连接到数据库的 Python 运行时。

注：本教程使用 Python 与 Oracle Database 进行交互，因为假定混合搜索文档将作为更广泛的 RAG 流程的一部分实施，但是仅使用 SQL 演示了关键功能，这应该允许将方法应用于其他开发语言。

任务 1：设置数据库表

用于混合搜索的表可以与用于向量搜索的表相同，因为 Oracle Text 可以为字符大对象 (CLOB) 字段编制索引，这些字段通常用于存储文档内容。

注：初始表设置的 SQL 直接显示在此处，而不是从 Python 调用。RAG 流程所使用的数据库用户帐户应仅具有查询表的权限，而不应创建表和索引，因为这些任务将由数据库管理员使用其首选工具执行。

连接到数据库后，创建一个表，用于使用以下 SQL 存储 RAG 进程使用的文档。
```
CREATE TABLE hybridsearch 
(id RAW(16) DEFAULT SYS_GUID() PRIMARY KEY,
text CLOB, 
embeddings VECTOR(768, FLOAT32),
metadata JSON);
```
向量列大小取决于嵌入模型，该模型将用于生成用于语义搜索的向量。这里我们使用的是 768，它与本例后面使用的向量模型相对应，但是如果使用了替代模型，可能需要更新此值以反映该更改。指定了用于存储文档元数据的 JSON 列，因为这可以提供灵活的结构，同时仍然允许对文档的属性进行过滤，尽管本教程中未使用它，但该列包含在实际场景中，因为任何场景都需要文档元数据。
要启用文本的关键字搜索，需要在文本列上创建一个文本索引。
```
CREATE SEARCH INDEX rag_text_index ON hybridsearch (text);
```

任务 2：安装库

本教程演示如何使用 Python 运行时实施文档摄取和混合搜索。建议使用 venv 或 conda 环境配置 Python 运行时。

注：本教程尝试根据需要引入代码部分来演示每个概念，如果要将其纳入更广泛的解决方案，则需要重构。

使用 pip 安装本教程所需的相关项。

$ pip install -U oracledb sentence-transformers git+https://github.com/LIAAD/yake

任务 3：将文档摄取到数据库中

创建表后，可以将文档作为行插入表中。通常，摄取过程应该与查询过程分开，并使用具有不同权限的不同数据库帐户（因为查询过程不应该能够修改表），但是对于本教程，它们在这里没有区别。

在 Python 环境中，建立与数据库的连接。例如，对于 Autonomous Transaction Processing。

import os
import oracledb
import traceback
import json 
import re
   
try:
    print(f'Attempting to connect to the database with user: [{os.environ["DB_USER"]}] and dsn: [{os.environ["DB_DSN"]}]')
    connection = oracledb.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], dsn=os.environ["DB_DSN"],
                                config_dir="/path/to/dbwallet",
                                wallet_location="/path/to/dbwallet",
                                wallet_password=os.environ["DB_WALLET_PASSWORD"])
    print("Connection successful!")
except Exception as e:
    print(traceback.format_exc())
    print("Connection failed!")

python-oracledb 文档提供了有关连接到可能不使用连接 wallet 的非 ADB 实例的详细信息。

初始化将用于计算嵌入向量的嵌入模型。此处使用的是 all-mpnet-base-v2 模型，该模型根据 Apache 许可证提供。尽管此特定嵌入模型仅用于说明，但根据您的数据，其他模型的性能可能会更好或更差。此示例使用 SentenceTransformers 接口来简化操作。有关更多信息，请参见 SentenceTransformers 文档。
```
from sentence_transformers import SentenceTransformer
   
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
```

实施简单的文档摄取函数。获取、解析和分块文档的过程不在本教程的范围之内，就本教程而言，假定它们仅作为字符串提供。此函数使用提供的模型计算嵌入，然后将文档和嵌入插入到创建的表中。

def add_document_to_table(connection, table_name, model, document, **kwargs):
    """
    Adds a document to the database for use in RAG.
    @param connection An established database connection.
    @param table_name The name of the table to add the document to
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param document The document to add, as a string
    Keyword Arguments:
    metadata: A dict with metadata about the document which is stored as a JSON object
    """
    #Calculate the embeddings for the document
    embeddings = model.encode(document)
    insert_sql = f"""INSERT INTO {table_name} (text, embeddings, metadata) VALUES (:text, :embeddings, :metadata)"""
    metadata = kwargs.get('metadata', {})
    cursor = connection.cursor()
    try:
        cursor.execute(insert_sql, text=document, embeddings=json.dumps(embeddings.tolist()), metadata=json.dumps(metadata))
    except Exception as e:
        print(traceback.format_exc())
        print("Insert failed!")

将一些示例文档添加到数据库中进行测试。

注：将调用显式 commit()，这将触发文本索引的更新。

table_name = "testhybrid"
   
# These samples are just included to provide some data to search, not to 
# demonstrate the efficacy of key phrase versus semantic search. The benefits
# of hybrid search typically start to emerge when using a much larger volume
# of content.
document_samples = [
    "Oracle Database 23ai is the next long-term support release of Oracle Database. It includes over 300 new features with a focus on artificial intelligence (AI) and developer productivity.",
    "Features such as AI Vector Search enable you to leverage a new generation of AI models to generate and store vectors of documents, images, sound, and so on;  index them and quickly look for similarity while leveraging the existing analytical capabilities of Oracle Database.",
    "New developer-focused features now make it simpler to build next-generation applications that use JSON or relational development approaches or both interchangeably.",
    "With a built-in VECTOR data type, you can run AI-powered vector similarity searches within the database instead of having to move business data to a separate vector database.",
    "Property graphs provide an intuitive way to find direct or indirect dependencies in data elements and extract insights from these relationships. The enterprise-grade manageability, security features, and performance features of Oracle Database are extended to property graphs.",
    "The ISO SQL standard has been extended to include comprehensive support for property graph queries and creating property graphs in SQL.",
    "Transactional Event Queues (TxEventQ) are queues built into the Oracle Database. TxEventQ are a high performance partitioned implementation with multiple event streams per queue.",
    "Transactional Event Queues (TxEventQ) now support the KafkaProducer and KafkaConsumer classes from Apache Kafka. Oracle Database can now be used as a source or target for applications using the Kafka APIs.",
    "Database metrics are stored in Prometheus, a time-series database and metrics tailored for developers are displayed using Grafana dashboards. A database metrics exporter aids the metrics exports from database views into Prometheus time series database."
    "The Java Database Connectivity (JDBC) API is the industry standard for database-independent connectivity between the Java programming language and a wide range of databases—SQL databases and other tabular data sources, such as spreadsheets or flat files.",
    "Java Database Connectivity (JDBC) is a Java standard that provides the interface for connecting from Java to relational databases. The JDBC standard is defined and implemented through the standard java.sql interfaces. This enables individual providers to implement and extend the standard with their own JDBC drivers.",
    "The JDBC Thin driver enables a direct connection to the database by providing an implementation of Oracle Net Services on top of Java sockets. The driver supports the TCP/IP protocol and requires a TNS listener on the TCP/IP sockets on the database server.",
    "The JDBC Thin driver is a pure Java, Type IV driver that can be used in applications. It is platform-independent and does not require any additional Oracle software on the client-side. The JDBC Thin driver communicates with the server using Oracle Net Services to access Oracle Database.",
    "The JDBC OCI driver, written in a combination of Java and C, converts JDBC invocations to calls to OCI, using native methods to call C-entry points. These calls communicate with the database using Oracle Net Services.",
    "The python-oracledb driver is a Python extension module that enables access to Oracle Database. By default, python-oracledb allows connecting directly to Oracle Database 12.1 or later. This Thin mode does not need Oracle Client libraries.",
    "Users interact with a Python application, for example by making web requests. The application program makes calls to python-oracledb functions. The connection from python-oracledb Thin mode to the Oracle Database is established directly.",
    "Python-oracledb is said to be in ‘Thick’ mode when it links with Oracle Client libraries. Depending on the version of the Oracle Client libraries, this mode of python-oracledb can connect to Oracle Database 9.2 or later.",
    "To use python-oracledb Thick mode, the Oracle Client libraries must be installed separately. The libraries can be from an installation of Oracle Instant Client, from a full Oracle Client installation (such as installed by Oracle’s GUI installer), or even from an Oracle Database installation (if Python is running on the same machine as the database).",
    "Oracle’s standard client-server version interoperability allows connection to both older and newer databases from different Oracle Client library versions."
]
   
for document in document_samples:
    add_document_to_table(connection, table_name, model, document)
   
#Call an explicit commit after adding the documents, which will trigger an async update of the text index
connection.commit()

任务 4：实施 Oracle Database 23ai AI 向量搜索

加载文档后，可以使用 Oracle Database 23ai AI Vector Search 功能基于查询推导的向量执行语义搜索。

实现 helper 函数以处理数据库返回的 CLOB 对象。

def get_clob(result):
    """
    Utility function for getting the value of a LOB result from the DB.
    @param result Raw value from the database
    @returns string
    """
    clob_value = ""
    if result:
        if isinstance(result, oracledb.LOB):
            raw_data = result.read()
            if isinstance(raw_data, bytes):
                clob_value = raw_data.decode("utf-8")
            else:
                clob_value = raw_data
        elif isinstance(result, str):
            clob_value = result
        else:
            raise Exception("Unexpected type:", type(result))
    return clob_value

实施使用 vector_distance() SQL 函数执行语义搜索的函数。本教程中使用的 all-mpnet-base-v2 模型使用 COSINE 相似性，此处已默认使用。如果使用不同的模型，则可能需要指定替代距离策略。

def retrieve_documents_by_vector_similarity(connection, table_name, model, query, num_results, **kwargs):
    """
    Retrieves the most similar documents from the database based upon semantic similarity.
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param query The string to search for semantic similarity with
    @param num_results The number of results to return
    Keyword Arguments:
    distance_strategy: The distance strategy to use for comparison One of: 'EUCLIDEAN', 'DOT', 'COSINE' - Default: COSINE
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    # In many cases, building up the search SQL may involve adding a WHERE 
    # clause in order to search only a subset of documents, though this is
    # omitted for this simple example.
    search_sql = f"""SELECT id, text, metadata,
                    vector_distance(embeddings, :embedding, {kwargs.get('distance_strategy', 'COSINE')}) as distance
                    FROM {table_name}
                    ORDER BY distance
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    query_embedding = model.encode(query)
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, embedding=json.dumps(query_embedding.tolist()))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
   
    return documents

使用以下示例验证语义搜索功能。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_vector_search = retrieve_documents_by_vector_similarity(connection, table_name, model, query, 4)
print(documents_from_vector_search)

任务 5：实施关键字搜索

本教程使用 Oracle Text，它在 Oracle Database 中提供了强大的文本查询工具。虽然 Oracle Text 为混合搜索提供了各种功能，但所需的一切都是按关键字或关键字短语进行简单搜索。关键字提取和搜索有许多技术，但是此实现的目的在于尽可能简单，使用 Yet Another Keyword Extractor (YAKE) ，该方法依赖于语言功能对关键字和关键字短语执行无监督提取。

关键字搜索有多种其他方法， Okapi BM25 算法很受欢迎。但是，将无监督关键字提取与强大的文本搜索索引（如 Oracle Text 提供的文本搜索索引）结合使用具有特别简单的优点，并通过与语义搜索的组合提供稳健性。

实现用于关键短语提取的函数。

import yake
   
def extract_keywords(query, num_results):
    """
    Utility function for extracting keywords from a string.
    @param query The string from which keywords should be extracted
    @param num_results The number of keywords/phrases to return
    @returns Array<(string, number)> Array of keywords/phrases as a tuple of 'keyword', 'score' (lower scores are more significant)
    """
    language = "en"
    #Max number of words to include in a key phrase
    max_ngram_size = 2
    windowSize = 1
   
    kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, windowsSize=windowSize, top=num_results, features=None)
    keywords = kw_extractor.extract_keywords(query.strip().lower())
    return sorted(keywords, key=lambda kw: kw[1])

由于这种关键字提取方法依赖于语言特征，因此设置适当的语言非常重要，并且性能可能会因语言本身而异。

使用以下示例验证关键字提取。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
keywords = extract_keywords(query, 4)
print(keywords)

实施用于执行基于关键字的搜索的函数。

def retrieve_documents_by_keywords(connection, table_name, query, num_results):
    """
    Retrieves the documents from the database which have the highest density of matching keywords as the query
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param query The string from which to extract keywords/phrases for searching
    @param num_results The number of results to return
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    num_keywords = 4
    keywords = extract_keywords(query, num_keywords)
    search_sql = f"""SELECT id, text, metadata, SCORE(1)
                    FROM {table_name}
                    WHERE CONTAINS (text, :query_keywords, 1) > 0
                    ORDER BY SCORE(1) DESC
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    #Assemble the keyword search query, adding the stemming operator to each word
    stemmed_keywords = []
    splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
    for keyword in keywords:
        stemmed_keyword = ""
        for single_word in splitter.split(keyword[0]):
            stemmed_keyword += "$" + single_word +" "
        stemmed_keywords.append(stemmed_keyword.strip())
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, query_keywords=",".join(stemmed_keywords))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
    return documents

Oracle Text 中最简单的行为之一是通过 CONTAINS 函数执行关键字搜索，该函数支持各种其他运算符来细化或扩展搜索。在本教程中，使用了 stemming 运算符。扩展查询以包括与指定术语具有相同词干或根词的所有术语。这用于规范化单词而不考虑复数和时态，以允许 cat 与 cats 匹配（例如）。有关详细信息，请参阅 Oracle Text CONTAINS Query Operators 。

注：如果将此应用于大型文档库，建议将文本索引配置为包括词干以提高性能。有关基本词法分析器的更多信息，请参见 BASIC_LEXER 。

使用以下示例验证基于关键字的搜索。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_keyphrase_search = retrieve_documents_by_keywords(connection, table_name, query, 4)
print(documents_from_keyphrase_search)

合并和使用结果

获得文档后，可以将其作为附加上下文提供给 LLM，以用于响应查询或指令。在某些情况下，将所有检索到的文档简单地包含在 LLM 的提示中可能是合适的。在其他情况下，缺乏相关文件本身就是一个重要的背景，因此，确定其相关性可能很重要。这可以基于对每种搜索类型的特定加权，或者可以独立于使用重新排名模型返回文档的相关性进行评估。

在每个用例中，删除重复的结果将是重要的一步。每个函数都保留了文档 ID，该 ID 提供可用于此目的的唯一标识符。例如：

def deduplicate_documents(*args):
    """
    Combines lists of documents returning a union of the lists, with duplicates removed (based upon an 'id' match)
    Arguments:
        Any number of arrays of documents as a tuple of 'id', 'text', 'metadata'
    @returns: Array<(string, string, dict)> Single array of documents containing no duplicates
    """
    #Definitely not the most efficient de-duplication, but in this case, lists are typically <10 items
    documents = []
    for document_list in args:
        for document in document_list:
            if document[0] not in map(lambda doc: doc[0], documents):
                documents.append(document)
    return documents

这两种方法允许从同一个数据库表中提取相关文档，使用两种不同的机制来确定相似性。这两种方法的执行速度都非常快，因此除了文本索引之外，应用这两种方法检索相关文档的开销最小。语义搜索允许检索文档，而不考虑使用同义词或偶尔键入，而关键短语搜索可以捕获用户非常具体地询问特定主题的情况，例如产品或函数名称。合并后的结果可以相互补充，为整个 RAG 流程增加稳健性。

确认

作者 -Callan Howell-Pavia（APAC 解决方案专家）