使用 Oracle Database 23ai 導入擷取擴增產生的簡單混合搜尋

簡介

本教學課程示範如何在 Oracle 23ai 中使用向量和關鍵字搜尋,在擷取擴增產生 (RAG) 流程中導入混合式搜尋。

對於使用大型語言模型 (LLM) 的企業而言,RAG 已成為重要功能,以增強對特定業務或網域資訊的回應。使用企業知識庫搜尋與使用者查詢相關的資訊,然後將擷取的資訊附加至 LLM 以允許根據內部資料、原則所繪製的回應。以及情境特定資訊,透過特定資訊可降低幻覺的可能性,以提供自然語言回應,其中包括知識庫中適當的引用和文件的參照。

Oracle Database 包含多種執行 RAG 作業的強大功能。Oracle Database 23ai 發行版本引進了 AI 向量搜尋功能,此功能可讓您對非結構化資料執行快速語意搜尋。使用具有高品質向量內嵌的語意搜尋,似乎幾乎是神奇的,幾乎任何查詢都會彙總來自龐大知識庫的高度相關文件。不過,正因為有可用的向量搜尋,且在大多數情況下提供高品質的結果並不表示應該放棄傳統的關鍵字式搜尋。任何花很多時間測試檢索的開發人員,肯定會發現一些奇怪之處,其中涵蓋特定主旨的文件,而哪些文件會直覺地包含在回應中,即使關鍵字搜尋是真正的。

所以,為什麼不同時使用兩者?

Oracle Database 23ai 以隨時間增加到資料庫的所有強大功能為基礎構建,包括提供豐富文字查詢功能的 Oracle Text,而此教學課程旨在示範如何在資料庫中同時存在這些功能,使實作健全的混合式搜尋變得更加簡單,在不複製任何資料的情況下提供全世界最好的搜尋。

注意:此教學課程示範使用本機內嵌模型在 Python 中實行混合搜尋邏輯。Oracle Database 23ai 支援透過使用 ONNX 模型在資料庫中計算文字內嵌,而透過資料庫管理系統 (DBMS) 套裝程式,在資料庫中使用 ONNX 模型原生支援混合搜尋。直接在 Python 中實作邏輯可更有效地控制搜尋的行為,但 DBMS 套裝軟體可為某些使用案例提供一組簡單但功能強大的功能。如需詳細資訊,請參閱將 ONNX 模型匯入 Oracle Database 端對端範例瞭解混合搜尋

目標

必要條件

作業 1:設定資料庫表格

用於混合搜尋的表格可以和用於向量搜尋的表格相同,因為 Oracle Text 可以編製「字元大型物件 (CLOB)」欄位的索引,這些欄位通常用來儲存文件內容。

備註:初始表格設定的 SQL 會直接顯示於此處,而非從 Python 呼叫。您 RAG 處理作業所使用的資料庫使用者帳戶應該只有查詢表格的權限,而非建立表格和索引,因為這些作業將由資料庫管理員使用其偏好的工具執行。

  1. 連線到資料庫之後,請建立一個表格,用來儲存使用下列 SQL 的 RAG 處理作業所使用的文件。

    CREATE TABLE hybridsearch 
    (id RAW(16) DEFAULT SYS_GUID() PRIMARY KEY,
    text CLOB, 
    embeddings VECTOR(768, FLOAT32),
    metadata JSON);
    

    向量資料欄大小取決於將用來產生語意搜尋向量之內嵌模型。在這裡,我們使用的是 768,與本範例稍後使用的向量模型相對應,但如果使用替代模型,則可能需要更新此值以反映該變更。指定用於儲存文件描述資料的 JSON 資料欄,因為這可提供彈性的結構,同時仍允許對文件的屬性進行篩選,但未在此教學課程中使用該資料欄,因為任何實際案例都需要文件描述資料。

  2. 若要啟用文字的關鍵字搜尋,必須在文字資料欄上建立文字索引。

    CREATE SEARCH INDEX rag_text_index ON hybridsearch (text);
    

作業 2:安裝程式庫

本教學課程說明如何使用 Python 程式實際執行來實行文件擷取和混合搜尋。建議您使用 venvconda 環境設定 Python 程式實際執行。

備註:本教學課程會嘗試根據示範每個概念的必要條件導入程式碼章節,並且如果要將程式碼併入更廣泛的解決方案,則需要重製。

  1. 使用 pip 安裝此教學課程所需的相依性。

    $ pip install -U oracledb sentence-transformers git+https://github.com/LIAAD/yake
    

作業 3:將文件攝取至資料庫

建立表格之後,即可將文件插入表格中作為資料列。一般而言,擷取處理作業應與查詢處理作業分開,並使用具有不同權限的不同資料庫帳戶 (因為查詢處理作業不應該修改表格),但是對於此教學課程而言,它們不會在此處區分。

  1. 在您的 Python 環境中,建立與資料庫的連線。例如,對於 Autonomous Transaction Processing。

    import os
    import oracledb
    import traceback
    import json 
    import re
       
    try:
        print(f'Attempting to connect to the database with user: [{os.environ["DB_USER"]}] and dsn: [{os.environ["DB_DSN"]}]')
        connection = oracledb.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], dsn=os.environ["DB_DSN"],
                                    config_dir="/path/to/dbwallet",
                                    wallet_location="/path/to/dbwallet",
                                    wallet_password=os.environ["DB_WALLET_PASSWORD"])
        print("Connection successful!")
    except Exception as e:
        print(traceback.format_exc())
        print("Connection failed!")
    

    python-oracledb 文件提供連線至無法使用連線公事包之非 ADB 執行處理的詳細資訊

  2. 初始化用於計算嵌入向量的嵌入模型 。此處使用 all-mpnet-base-v2 模型 (根據 Apache 授權提供)。雖然此特定內嵌模型僅用於圖解,但其他模型可能會根據您的資料執行得更好或更差。此範例使用 SentenceTransformers 介面來簡化。如需詳細資訊,請參閱 SentenceTransformers 文件

    from sentence_transformers import SentenceTransformer
       
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    
  3. 實作簡單的文件擷取功能。在本自學課程中,取得、剖析和分區文件的程序已經超出範圍,而在本自學課程中,假設它們只是作為字串提供。此函數會使用提供的模型計算內嵌項目,然後將文件插入建立的表格中。

    def add_document_to_table(connection, table_name, model, document, **kwargs):
        """
        Adds a document to the database for use in RAG.
        @param connection An established database connection.
        @param table_name The name of the table to add the document to
        @param model A sentence transformers model, with an 'encode' function that returns embeddings
        @param document The document to add, as a string
        Keyword Arguments:
        metadata: A dict with metadata about the document which is stored as a JSON object
        """
        #Calculate the embeddings for the document
        embeddings = model.encode(document)
        insert_sql = f"""INSERT INTO {table_name} (text, embeddings, metadata) VALUES (:text, :embeddings, :metadata)"""
        metadata = kwargs.get('metadata', {})
        cursor = connection.cursor()
        try:
            cursor.execute(insert_sql, text=document, embeddings=json.dumps(embeddings.tolist()), metadata=json.dumps(metadata))
        except Exception as e:
            print(traceback.format_exc())
            print("Insert failed!")
    
  4. 新增一些樣本文件至資料庫以進行測試。

    注意:會呼叫明確的 commit(),這會觸發文字索引的更新。

    table_name = "testhybrid"
       
    # These samples are just included to provide some data to search, not to 
    # demonstrate the efficacy of key phrase versus semantic search. The benefits
    # of hybrid search typically start to emerge when using a much larger volume
    # of content.
    document_samples = [
        "Oracle Database 23ai is the next long-term support release of Oracle Database. It includes over 300 new features with a focus on artificial intelligence (AI) and developer productivity.",
        "Features such as AI Vector Search enable you to leverage a new generation of AI models to generate and store vectors of documents, images, sound, and so on;  index them and quickly look for similarity while leveraging the existing analytical capabilities of Oracle Database.",
        "New developer-focused features now make it simpler to build next-generation applications that use JSON or relational development approaches or both interchangeably.",
        "With a built-in VECTOR data type, you can run AI-powered vector similarity searches within the database instead of having to move business data to a separate vector database.",
        "Property graphs provide an intuitive way to find direct or indirect dependencies in data elements and extract insights from these relationships. The enterprise-grade manageability, security features, and performance features of Oracle Database are extended to property graphs.",
        "The ISO SQL standard has been extended to include comprehensive support for property graph queries and creating property graphs in SQL.",
        "Transactional Event Queues (TxEventQ) are queues built into the Oracle Database. TxEventQ are a high performance partitioned implementation with multiple event streams per queue.",
        "Transactional Event Queues (TxEventQ) now support the KafkaProducer and KafkaConsumer classes from Apache Kafka. Oracle Database can now be used as a source or target for applications using the Kafka APIs.",
        "Database metrics are stored in Prometheus, a time-series database and metrics tailored for developers are displayed using Grafana dashboards. A database metrics exporter aids the metrics exports from database views into Prometheus time series database."
        "The Java Database Connectivity (JDBC) API is the industry standard for database-independent connectivity between the Java programming language and a wide range of databases—SQL databases and other tabular data sources, such as spreadsheets or flat files.",
        "Java Database Connectivity (JDBC) is a Java standard that provides the interface for connecting from Java to relational databases. The JDBC standard is defined and implemented through the standard java.sql interfaces. This enables individual providers to implement and extend the standard with their own JDBC drivers.",
        "The JDBC Thin driver enables a direct connection to the database by providing an implementation of Oracle Net Services on top of Java sockets. The driver supports the TCP/IP protocol and requires a TNS listener on the TCP/IP sockets on the database server.",
        "The JDBC Thin driver is a pure Java, Type IV driver that can be used in applications. It is platform-independent and does not require any additional Oracle software on the client-side. The JDBC Thin driver communicates with the server using Oracle Net Services to access Oracle Database.",
        "The JDBC OCI driver, written in a combination of Java and C, converts JDBC invocations to calls to OCI, using native methods to call C-entry points. These calls communicate with the database using Oracle Net Services.",
        "The python-oracledb driver is a Python extension module that enables access to Oracle Database. By default, python-oracledb allows connecting directly to Oracle Database 12.1 or later. This Thin mode does not need Oracle Client libraries.",
        "Users interact with a Python application, for example by making web requests. The application program makes calls to python-oracledb functions. The connection from python-oracledb Thin mode to the Oracle Database is established directly.",
        "Python-oracledb is said to be in ‘Thick’ mode when it links with Oracle Client libraries. Depending on the version of the Oracle Client libraries, this mode of python-oracledb can connect to Oracle Database 9.2 or later.",
        "To use python-oracledb Thick mode, the Oracle Client libraries must be installed separately. The libraries can be from an installation of Oracle Instant Client, from a full Oracle Client installation (such as installed by Oracle’s GUI installer), or even from an Oracle Database installation (if Python is running on the same machine as the database).",
        "Oracle’s standard client-server version interoperability allows connection to both older and newer databases from different Oracle Client library versions."
    ]
       
    for document in document_samples:
        add_document_to_table(connection, table_name, model, document)
       
    #Call an explicit commit after adding the documents, which will trigger an async update of the text index
    connection.commit()
    

載入文件之後,Oracle Database 23ai AI Vector Search 功能就可以用來根據我們從查詢衍生的向量執行語意搜尋。

  1. 實作協助程式函數以使用資料庫傳回的 CLOB 物件。

    def get_clob(result):
        """
        Utility function for getting the value of a LOB result from the DB.
        @param result Raw value from the database
        @returns string
        """
        clob_value = ""
        if result:
            if isinstance(result, oracledb.LOB):
                raw_data = result.read()
                if isinstance(raw_data, bytes):
                    clob_value = raw_data.decode("utf-8")
                else:
                    clob_value = raw_data
            elif isinstance(result, str):
                clob_value = result
            else:
                raise Exception("Unexpected type:", type(result))
        return clob_value
    
  2. 實行使用 vector_distance() SQL 函數執行語意搜尋的函數。此教學課程中使用的 all-mpnet-base-v2 模型使用 COSINE 相似性,已在此處預設。如果您使用其他模型,則可能需要指定替代距離策略。

    def retrieve_documents_by_vector_similarity(connection, table_name, model, query, num_results, **kwargs):
        """
        Retrieves the most similar documents from the database based upon semantic similarity.
        @param connection An established database connection.
        @param table_name The name of the table to query
        @param model A sentence transformers model, with an 'encode' function that returns embeddings
        @param query The string to search for semantic similarity with
        @param num_results The number of results to return
        Keyword Arguments:
        distance_strategy: The distance strategy to use for comparison One of: 'EUCLIDEAN', 'DOT', 'COSINE' - Default: COSINE
        @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
        """
        # In many cases, building up the search SQL may involve adding a WHERE 
        # clause in order to search only a subset of documents, though this is
        # omitted for this simple example.
        search_sql = f"""SELECT id, text, metadata,
                        vector_distance(embeddings, :embedding, {kwargs.get('distance_strategy', 'COSINE')}) as distance
                        FROM {table_name}
                        ORDER BY distance
                        FETCH APPROX FIRST {num_results} ROWS ONLY
                        """
        query_embedding = model.encode(query)
        cursor = connection.cursor()
        try:
            cursor.execute(search_sql, embedding=json.dumps(query_embedding.tolist()))
        except Exception as e:
            print(traceback.format_exc())
            print("Retrieval failed!")
        rows = cursor.fetchall()
        documents = []
        for row in rows:
            documents.append((row[0].hex(), get_clob(row[1]), row[2]))
       
        return documents
    
  3. 使用下列範例驗證語意搜尋功能。

    query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
       
    documents_from_vector_search = retrieve_documents_by_vector_similarity(connection, table_name, model, query, 4)
    print(documents_from_vector_search)
    

本教學課程使用 Oracle Text,在 Oracle Database 中提供強大的文字查詢工具。雖然 Oracle Text 為混合式搜尋提供了廣泛的功能,但所有必要功能都是依關鍵字或關鍵字詞組進行的簡單搜尋。有許多關鍵字擷取與搜尋的技術,不過,此實作的目的是盡可能簡單,使用 Yet Another Keyword Extractor (YAKE) ,這仰賴語言功能來執行關鍵字與關鍵字詞組的未監督擷取。

關鍵字搜尋有多種其他方法,其中 Okapi BM25 演算法為常用演算法。不過,使用非監督關鍵字擷取搭配強大的文字搜尋索引 (例如 Oracle Text 所提供的索引) 具有特別簡單的優點,而且透過與語意搜尋的組合提供了健全性。

  1. 實作擷取關鍵詞彙的功能。

    import yake
       
    def extract_keywords(query, num_results):
        """
        Utility function for extracting keywords from a string.
        @param query The string from which keywords should be extracted
        @param num_results The number of keywords/phrases to return
        @returns Array<(string, number)> Array of keywords/phrases as a tuple of 'keyword', 'score' (lower scores are more significant)
        """
        language = "en"
        #Max number of words to include in a key phrase
        max_ngram_size = 2
        windowSize = 1
       
        kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, windowsSize=windowSize, top=num_results, features=None)
        keywords = kw_extractor.extract_keywords(query.strip().lower())
        return sorted(keywords, key=lambda kw: kw[1])
    

    由於此關鍵字擷取方法依賴語言功能,因此設定適當的語言很重要,而且效能可能會因語言本身而異。

  2. 請使用下列範例驗證關鍵字擷取。

    query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
       
    keywords = extract_keywords(query, 4)
    print(keywords)
    
  3. 實行執行關鍵字式搜尋的功能。

    def retrieve_documents_by_keywords(connection, table_name, query, num_results):
        """
        Retrieves the documents from the database which have the highest density of matching keywords as the query
        @param connection An established database connection.
        @param table_name The name of the table to query
        @param query The string from which to extract keywords/phrases for searching
        @param num_results The number of results to return
        @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
        """
        num_keywords = 4
        keywords = extract_keywords(query, num_keywords)
        search_sql = f"""SELECT id, text, metadata, SCORE(1)
                        FROM {table_name}
                        WHERE CONTAINS (text, :query_keywords, 1) > 0
                        ORDER BY SCORE(1) DESC
                        FETCH APPROX FIRST {num_results} ROWS ONLY
                        """
        #Assemble the keyword search query, adding the stemming operator to each word
        stemmed_keywords = []
        splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
        for keyword in keywords:
            stemmed_keyword = ""
            for single_word in splitter.split(keyword[0]):
                stemmed_keyword += "$" + single_word +" "
            stemmed_keywords.append(stemmed_keyword.strip())
        cursor = connection.cursor()
        try:
            cursor.execute(search_sql, query_keywords=",".join(stemmed_keywords))
        except Exception as e:
            print(traceback.format_exc())
            print("Retrieval failed!")
        rows = cursor.fetchall()
        documents = []
        for row in rows:
            documents.append((row[0].hex(), get_clob(row[1]), row[2]))
        return documents
    

    Oracle Text 中最簡單的行為之一是透過 CONTAINS 函數執行關鍵字搜尋,此函數支援廣泛的額外運算子來縮小或擴大搜尋範圍。本教學課程中使用 stemming 運算子。展開查詢以包含與指定詞彙具有相同字幹或根字的所有詞彙。這是用來將文字標準化,不論是否有固定性與時態,以允許 cat 比對執行處理的 cats 。如需詳細資訊,請參閱 Oracle Text CONTAINS 查詢運算子

    注意:如果將此項目套用至大型文件語料庫,建議您將文字索引設定為包括字體,以提升效能。如需基本詞彙分析程式的詳細資訊,請參閱 BASIC_LEXER

  4. 使用下列範例驗證關鍵字式搜尋。

    query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
       
    documents_from_keyphrase_search = retrieve_documents_by_keywords(connection, table_name, query, 4)
    print(documents_from_keyphrase_search)
    

結合與使用結果

取得文件之後,即可將文件提供給 LLM 作為其他相關資訊環境,以用於回應查詢或指示。在某些情況下,只需將所有擷取的文件包含在 LLM 的提示中即可。在其他情況下,缺乏相關文件是自身的重要相關資訊環境,因此決定相關性可能很重要。這可能是根據每個搜尋類型所放置的特定加權,或者每個文件的相關性都可以使用重新排名模型來獨立評估。

在每個使用案例中,去除重複結果會是重要步驟。每個函數都已保留文件 ID,提供可用於此目的的唯一識別碼。舉例而言:

def deduplicate_documents(*args):
    """
    Combines lists of documents returning a union of the lists, with duplicates removed (based upon an 'id' match)
    Arguments:
        Any number of arrays of documents as a tuple of 'id', 'text', 'metadata'
    @returns: Array<(string, string, dict)> Single array of documents containing no duplicates
    """
    #Definitely not the most efficient de-duplication, but in this case, lists are typically <10 items
    documents = []
    for document_list in args:
        for document in document_list:
            if document[0] not in map(lambda doc: doc[0], documents):
                documents.append(document)
    return documents

這兩種方法可讓您從同一個資料庫表格擷取相關文件,並使用兩種不同的機制來判斷相似性。這兩種方法都非常快速地執行,因此除了文字索引之外,只有極少的負荷可套用這兩種技術來擷取相關文件。語意搜尋可擷取文件,不論是否使用同義字或偶爾打字,而關鍵詞語搜尋則可擷取使用者非常特別詢問特定主題的案例,例如產品或功能名稱。結合的結果可以互補,為整體 RAG 處理程序增加健全度。

認可

其他學習資源

探索 docs.oracle.com/learn 上的其他實驗室,或存取 Oracle Learning YouTube 頻道上的更多免費學習內容。此外,請造訪 education.oracle.com/learning-explorer 以成為 Oracle Learning Explorer。

如需產品文件,請造訪 Oracle Help Center