Oracle Database 23aiを使用した取得拡張生成の単純なハイブリッド検索の実装

イントロダクション

このチュートリアルでは、Oracle 23aiでベクトル検索とキーワード検索の両方を使用して、取得拡張生成(RAG)プロセスの一部としてハイブリッド検索を実装する方法を示します。

RAGは、大規模言語モデル(LLM)を使用する企業が、特定のビジネス情報またはドメイン情報を使用して応答を強化するための重要な機能として登場しました。ユーザーの問合せに関連する情報をエンタープライズ・ナレッジ・ベースで検索し、取得した情報をLLMにアタッチすると、内部データ、ポリシーに基づくレスポンスが可能になります。シナリオ固有の情報により、特定の情報により幻覚の可能性が低下し、ナレッジ・ベースのドキュメントへの適切な引用および参照を含む自然言語対応が可能になります。

Oracle Databaseには、RAGタスクを実行するための複数の強力な機能が含まれています。Oracle Database 23aiリリースでは、AIベクトル検索機能が導入されました。これにより、非構造化データに対して高速なセマンティック検索を実行できます。高品質のベクトル埋め込みでセマンティック検索を使用すると、ほぼ魔法のように思えることがあります。ほとんどのクエリーは、膨大なナレッジベースから非常に関連性の高いドキュメントを要約しています。ただし、ベクトル検索が使用可能で、ほとんどのシナリオで高品質な結果を提供するだけでは、従来のキーワードベースの検索を中止する必要があるわけではありません。検索のテストに多くの時間を費やした開発者は、確かにいくつかの奇妙さを発見しました。その中で、質問された特定の被験者を対象としたドキュメントが、キーワード検索で簡単に見つかっても、応答に直感的に含まれるドキュメントはそうではありません。

なんで両方使わないの?

Oracle Database 23aiは、リッチ・テキスト問合せ機能を提供するOracle Textを含む、時間の経過とともにデータベースに追加されたすべての強力な機能を基盤としています。このチュートリアルでは、データベース内にこれらの機能の両方が存在することで、堅牢なハイブリッド検索の実装が非常に簡単になることを示し、データの重複なしに両方の長所を提供します。

ノート:このチュートリアルでは、ローカルの埋込みモデルを使用して、Pythonでハイブリッド検索のロジックを実装する方法を示します。Oracle Database 23aiでは、ONNXモデルを使用したデータベースへのテキスト埋込みの計算がサポートされ、データベース管理システム(DBMS)パッケージを介したデータベース内のONNXモデルを使用したハイブリッド検索のネイティブ・サポートがあります。Pythonでロジックを直接実装すると、検索の動作をより詳細に制御できますが、DBMSパッケージは、いくつかのユースケースに対してシンプルで強力な機能セットを提供します。詳細は、「Oracle Databaseのエンドツーエンドの例へのONNXモデルのインポート」および「ハイブリッド検索の理解」を参照してください。

目的

ハイブリッド検索用のデータベーステーブルを設定します。
Pythonで単純なドキュメント取込みプロセスを実装します。
ドキュメントのベクトル検索およびキーワード検索を実装します。

前提条件

表および索引を作成できる1人以上のユーザーがいるOracle Database 23aiデータベース・インスタンスへのアクセス。
データベースに接続できるPythonランタイム。

ノート:このチュートリアルでは、Pythonを使用してOracle Databaseと対話します。これは、ドキュメントのハイブリッド検索がより広範なRAGプロセスの一部として実装されることを前提としていますが、主要な機能はSQLのみを使用してデモンストレーションされるため、他の開発言語でアプローチを適用できる必要があります。

タスク1: データベース表の設定

ハイブリッド検索に使用される表は、ベクトル検索に使用される表と同じにできます。Oracle Textでは、通常、ドキュメント・コンテンツの格納に使用されるキャラクタ・ラージ・オブジェクト(CLOB)フィールドを索引付けできるためです。

ノート: Pythonから起動されるのではなく、初期表設定のSQLがここに直接表示されます。RAGプロセスで使用するデータベース・ユーザー・アカウントは、表を問い合せる権限のみを持つ必要があり、表および索引を作成することはできません。このようなタスクは、データベース管理者が優先ツールを使用して実行します。

データベースに接続したら、次のSQLを使用して、RAGプロセスで使用されるドキュメントの格納に使用する表を作成します。
```
CREATE TABLE hybridsearch 
(id RAW(16) DEFAULT SYS_GUID() PRIMARY KEY,
text CLOB, 
embeddings VECTOR(768, FLOAT32),
metadata JSON);
```
ベクトル列サイズは、セマンティック検索のベクトルの生成に使用される埋込みモデルに依存します。ここでは、この例の後半で使用されるベクトルモデルに対応する768を使用していますが、代替モデルを使用する場合は、この値を更新してその変更を反映する必要がある場合があります。JSON列は、ドキュメントのメタデータを格納するために指定されます。これは、ドキュメントの属性に対するフィルタリングを許可しながら柔軟な構造を提供できるため、このチュートリアルでは使用しませんが、実際のシナリオではドキュメント・メタデータが必要になるため、含まれています。
テキストのキーワード検索を有効にするには、テキスト列にテキスト索引を作成する必要があります。
```
CREATE SEARCH INDEX rag_text_index ON hybridsearch (text);
```

タスク2: ライブラリのインストール

このチュートリアルでは、Pythonランタイムを使用してドキュメント取込みおよびハイブリッド検索を実装する方法を示します。venvまたはconda環境を使用してPythonランタイムを構成することをお薦めします。

ノート:このチュートリアルでは、各概念をデモンストレーションするために必要なコードのセクションを紹介し、より広範なソリューションに組み込む場合はリファクタリングが必要です。

pipを使用して、このチュートリアルに必要な依存関係をインストールします。
```
$ pip install -U oracledb sentence-transformers git+https://github.com/LIAAD/yake
```

タスク3: データベースへのドキュメントの取込み

表が作成されると、ドキュメントが行として表に挿入されます。通常、取込みプロセスは問合せプロセスとは別に、異なる権限を持つ異なるデータベース・アカウントを使用する必要があります(問合せプロセスでは表を変更できないため)。ただし、このチュートリアルでは、ここでは区別しません。

Python環境で、データベースへの接続を確立します。たとえば、Autonomous Transaction Processingなどです。

import os
import oracledb
import traceback
import json 
import re
   
try:
    print(f'Attempting to connect to the database with user: [{os.environ["DB_USER"]}] and dsn: [{os.environ["DB_DSN"]}]')
    connection = oracledb.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], dsn=os.environ["DB_DSN"],
                                config_dir="/path/to/dbwallet",
                                wallet_location="/path/to/dbwallet",
                                wallet_password=os.environ["DB_WALLET_PASSWORD"])
    print("Connection successful!")
except Exception as e:
    print(traceback.format_exc())
    print("Connection failed!")

python-oracledbドキュメントは、接続ウォレットを使用しない可能性のあるADB以外のインスタンスへの接続の詳細を提供します。

埋込みベクトルの計算に使用される埋込みモデルを初期化します。ここでは、all-mpnet-base-v2モデルが使用されています。このモデルはApacheライセンスで入手できます。この特定の埋込みモデルは図にのみ使用されますが、他のモデルはデータに応じてパフォーマンスが良くなるか悪くなる可能性があります。この例では、簡略化のためにSentenceTransformersインタフェースを使用しています。詳細は、SentenceTransformersドキュメントを参照してください。
```
from sentence_transformers import SentenceTransformer
   
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
```

単純なドキュメント取込み関数を実装します。ドキュメントを取得、解析およびチャンクするプロセスは、このチュートリアルの範囲外であり、このチュートリアルの目的のために、単に文字列として提供されることを前提としています。この関数は、提供されたモデルを使用して埋込みを計算し、文書および埋込みを作成した表に挿入します。

def add_document_to_table(connection, table_name, model, document, **kwargs):
    """
    Adds a document to the database for use in RAG.
    @param connection An established database connection.
    @param table_name The name of the table to add the document to
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param document The document to add, as a string
    Keyword Arguments:
    metadata: A dict with metadata about the document which is stored as a JSON object
    """
    #Calculate the embeddings for the document
    embeddings = model.encode(document)
    insert_sql = f"""INSERT INTO {table_name} (text, embeddings, metadata) VALUES (:text, :embeddings, :metadata)"""
    metadata = kwargs.get('metadata', {})
    cursor = connection.cursor()
    try:
        cursor.execute(insert_sql, text=document, embeddings=json.dumps(embeddings.tolist()), metadata=json.dumps(metadata))
    except Exception as e:
        print(traceback.format_exc())
        print("Insert failed!")

テストのためにデータベースにサンプル・ドキュメントを追加します。

ノート:明示的なcommit()が呼び出され、テキスト索引の更新がトリガーされます。

table_name = "testhybrid"
   
# These samples are just included to provide some data to search, not to 
# demonstrate the efficacy of key phrase versus semantic search. The benefits
# of hybrid search typically start to emerge when using a much larger volume
# of content.
document_samples = [
    "Oracle Database 23ai is the next long-term support release of Oracle Database. It includes over 300 new features with a focus on artificial intelligence (AI) and developer productivity.",
    "Features such as AI Vector Search enable you to leverage a new generation of AI models to generate and store vectors of documents, images, sound, and so on;  index them and quickly look for similarity while leveraging the existing analytical capabilities of Oracle Database.",
    "New developer-focused features now make it simpler to build next-generation applications that use JSON or relational development approaches or both interchangeably.",
    "With a built-in VECTOR data type, you can run AI-powered vector similarity searches within the database instead of having to move business data to a separate vector database.",
    "Property graphs provide an intuitive way to find direct or indirect dependencies in data elements and extract insights from these relationships. The enterprise-grade manageability, security features, and performance features of Oracle Database are extended to property graphs.",
    "The ISO SQL standard has been extended to include comprehensive support for property graph queries and creating property graphs in SQL.",
    "Transactional Event Queues (TxEventQ) are queues built into the Oracle Database. TxEventQ are a high performance partitioned implementation with multiple event streams per queue.",
    "Transactional Event Queues (TxEventQ) now support the KafkaProducer and KafkaConsumer classes from Apache Kafka. Oracle Database can now be used as a source or target for applications using the Kafka APIs.",
    "Database metrics are stored in Prometheus, a time-series database and metrics tailored for developers are displayed using Grafana dashboards. A database metrics exporter aids the metrics exports from database views into Prometheus time series database."
    "The Java Database Connectivity (JDBC) API is the industry standard for database-independent connectivity between the Java programming language and a wide range of databases—SQL databases and other tabular data sources, such as spreadsheets or flat files.",
    "Java Database Connectivity (JDBC) is a Java standard that provides the interface for connecting from Java to relational databases. The JDBC standard is defined and implemented through the standard java.sql interfaces. This enables individual providers to implement and extend the standard with their own JDBC drivers.",
    "The JDBC Thin driver enables a direct connection to the database by providing an implementation of Oracle Net Services on top of Java sockets. The driver supports the TCP/IP protocol and requires a TNS listener on the TCP/IP sockets on the database server.",
    "The JDBC Thin driver is a pure Java, Type IV driver that can be used in applications. It is platform-independent and does not require any additional Oracle software on the client-side. The JDBC Thin driver communicates with the server using Oracle Net Services to access Oracle Database.",
    "The JDBC OCI driver, written in a combination of Java and C, converts JDBC invocations to calls to OCI, using native methods to call C-entry points. These calls communicate with the database using Oracle Net Services.",
    "The python-oracledb driver is a Python extension module that enables access to Oracle Database. By default, python-oracledb allows connecting directly to Oracle Database 12.1 or later. This Thin mode does not need Oracle Client libraries.",
    "Users interact with a Python application, for example by making web requests. The application program makes calls to python-oracledb functions. The connection from python-oracledb Thin mode to the Oracle Database is established directly.",
    "Python-oracledb is said to be in ‘Thick’ mode when it links with Oracle Client libraries. Depending on the version of the Oracle Client libraries, this mode of python-oracledb can connect to Oracle Database 9.2 or later.",
    "To use python-oracledb Thick mode, the Oracle Client libraries must be installed separately. The libraries can be from an installation of Oracle Instant Client, from a full Oracle Client installation (such as installed by Oracle’s GUI installer), or even from an Oracle Database installation (if Python is running on the same machine as the database).",
    "Oracle’s standard client-server version interoperability allows connection to both older and newer databases from different Oracle Client library versions."
]
   
for document in document_samples:
    add_document_to_table(connection, table_name, model, document)
   
#Call an explicit commit after adding the documents, which will trigger an async update of the text index
connection.commit()

タスク4: Oracle Database 23ai AI Vector Searchの実装

ドキュメントがロードされると、Oracle Database 23ai AI Vector Search機能を使用して、問合せから導出されたベクトルに基づいてセマンティック検索を実行できます。

ヘルパー関数を実装して、データベースによって返されるCLOBオブジェクトを操作します。

def get_clob(result):
    """
    Utility function for getting the value of a LOB result from the DB.
    @param result Raw value from the database
    @returns string
    """
    clob_value = ""
    if result:
        if isinstance(result, oracledb.LOB):
            raw_data = result.read()
            if isinstance(raw_data, bytes):
                clob_value = raw_data.decode("utf-8")
            else:
                clob_value = raw_data
        elif isinstance(result, str):
            clob_value = result
        else:
            raise Exception("Unexpected type:", type(result))
    return clob_value

vector_distance() SQL関数を使用してセマンティック検索を実行する関数を実装します。このチュートリアルで使用するall-mpnet-base-v2モデルでは、ここでデフォルト設定されているCOSINE類似性を使用します。別のモデルを使用する場合は、別の距離戦略を指定する必要がある場合があります。

def retrieve_documents_by_vector_similarity(connection, table_name, model, query, num_results, **kwargs):
    """
    Retrieves the most similar documents from the database based upon semantic similarity.
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param model A sentence transformers model, with an 'encode' function that returns embeddings
    @param query The string to search for semantic similarity with
    @param num_results The number of results to return
    Keyword Arguments:
    distance_strategy: The distance strategy to use for comparison One of: 'EUCLIDEAN', 'DOT', 'COSINE' - Default: COSINE
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    # In many cases, building up the search SQL may involve adding a WHERE 
    # clause in order to search only a subset of documents, though this is
    # omitted for this simple example.
    search_sql = f"""SELECT id, text, metadata,
                    vector_distance(embeddings, :embedding, {kwargs.get('distance_strategy', 'COSINE')}) as distance
                    FROM {table_name}
                    ORDER BY distance
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    query_embedding = model.encode(query)
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, embedding=json.dumps(query_embedding.tolist()))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
   
    return documents

次のサンプルを使用して、セマンティック検索機能を検証します。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_vector_search = retrieve_documents_by_vector_similarity(connection, table_name, model, query, 4)
print(documents_from_vector_search)

タスク5: キーワード検索の実装

このチュートリアルでは、Oracle Database内で強力なテキスト問合せツールを提供するOracle Textを使用します。Oracle Textでは、ハイブリッド検索の目的で幅広い機能が提供されますが、必要なのはキーワードまたはキー・フレーズによる単純な検索のみです。キーワード抽出および検索には多数の手法がありますが、この実装は、キーワードおよびキー・フレーズの教師なし抽出を実行するために言語機能に依存するYet Another Keyword Extractor (YAKE)を使用して、可能なかぎり単純なものにすることを目的としています。

キーワード検索には、Okapi BM25アルゴリズムが広く普及しているほか、さまざまなアプローチがあります。ただし、Oracle Textで提供されるような強力なテキスト検索索引で教師なしキーワード抽出を使用すると、特に単純であることの利点があり、セマンティック検索と組み合せて堅牢性が提供されます。

キー・フレーズ抽出用のファンクションを実装します。

import yake
   
def extract_keywords(query, num_results):
    """
    Utility function for extracting keywords from a string.
    @param query The string from which keywords should be extracted
    @param num_results The number of keywords/phrases to return
    @returns Array<(string, number)> Array of keywords/phrases as a tuple of 'keyword', 'score' (lower scores are more significant)
    """
    language = "en"
    #Max number of words to include in a key phrase
    max_ngram_size = 2
    windowSize = 1
   
    kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, windowsSize=windowSize, top=num_results, features=None)
    keywords = kw_extractor.extract_keywords(query.strip().lower())
    return sorted(keywords, key=lambda kw: kw[1])

キーワード抽出のこの方法は言語機能に依存するため、適切な言語の設定が重要であり、パフォーマンスは言語自体によって異なる場合があります。

次のサンプルを使用してキーワード抽出を検証します。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
keywords = extract_keywords(query, 4)
print(keywords)

キーワード・ベースの検索を実行するための関数を実装します。

def retrieve_documents_by_keywords(connection, table_name, query, num_results):
    """
    Retrieves the documents from the database which have the highest density of matching keywords as the query
    @param connection An established database connection.
    @param table_name The name of the table to query
    @param query The string from which to extract keywords/phrases for searching
    @param num_results The number of results to return
    @returns: Array<(string, string, dict)> Array of documents as a tuple of 'id', 'text', 'metadata'
    """
    num_keywords = 4
    keywords = extract_keywords(query, num_keywords)
    search_sql = f"""SELECT id, text, metadata, SCORE(1)
                    FROM {table_name}
                    WHERE CONTAINS (text, :query_keywords, 1) > 0
                    ORDER BY SCORE(1) DESC
                    FETCH APPROX FIRST {num_results} ROWS ONLY
                    """
    #Assemble the keyword search query, adding the stemming operator to each word
    stemmed_keywords = []
    splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
    for keyword in keywords:
        stemmed_keyword = ""
        for single_word in splitter.split(keyword[0]):
            stemmed_keyword += "$" + single_word +" "
        stemmed_keywords.append(stemmed_keyword.strip())
    cursor = connection.cursor()
    try:
        cursor.execute(search_sql, query_keywords=",".join(stemmed_keywords))
    except Exception as e:
        print(traceback.format_exc())
        print("Retrieval failed!")
    rows = cursor.fetchall()
    documents = []
    for row in rows:
        documents.append((row[0].hex(), get_clob(row[1]), row[2]))
    return documents

Oracle Textで最も単純な動作の1つは、CONTAINS関数を介してキーワード検索を実行することです。この関数では、検索を絞り込むか拡張するための様々な追加演算子がサポートされています。このチュートリアルでは、ステミング演算子を使用します。同じ語幹または語根を持つすべての語句が、指定した問合せ語句に含まれるように問合せを拡張します。これは、catが catsと一致するように、複数形や時制に関係なく単語を正規化するために使用されます。詳細は、「Oracle Text CONTAINS問合せ演算子」を参照してください。

ノート:これを大量のドキュメント・コーパスに適用する場合は、パフォーマンスを向上させるためにワード・ステムを含めるようにテキスト索引を構成することをお薦めします。Basicレクサーの詳細は、BASIC_LEXERを参照してください。

次のサンプルを使用して、キーワード・ベースの検索を検証します。

query = "I am writing a python application and want to use Apache Kafka for interacting with queues, is this supported by the Oracle database?"
   
documents_from_keyphrase_search = retrieve_documents_by_keywords(connection, table_name, query, 4)
print(documents_from_keyphrase_search)

結果の結合と使用

ドキュメントを取得したら、LLMに追加コンテキストとして提供でき、これを使用して問合せまたは指示に応答できます。一部のシナリオでは、取得したすべてのドキュメントをLLMへのプロンプトに単純に含めることが適切である場合があります。他の場合、関連文書の欠如は、それ自体において重要な文脈であり、その関連性を判断することが重要である可能性があります。これは、検索の各タイプに基づいて特定の重み付けに基づいている場合や、各ドキュメントの関連性は、再ランク付けモデルを使用して返された方法とは無関係に評価できます。

これらの各ユースケースでは、結果の複製解除が重要なステップになります。各関数は、この目的で使用できる一意の識別子を提供するドキュメントIDを保持しています。たとえば次のようにします。

def deduplicate_documents(*args):
    """
    Combines lists of documents returning a union of the lists, with duplicates removed (based upon an 'id' match)
    Arguments:
        Any number of arrays of documents as a tuple of 'id', 'text', 'metadata'
    @returns: Array<(string, string, dict)> Single array of documents containing no duplicates
    """
    #Definitely not the most efficient de-duplication, but in this case, lists are typically <10 items
    documents = []
    for document_list in args:
        for document in document_list:
            if document[0] not in map(lambda doc: doc[0], documents):
                documents.append(document)
    return documents

これら2つの方法では、類似性を判断するための2つの異なるメカニズムを使用して、同じデータベース表から関連ドキュメントを抽出できます。どちらのメソッドも非常に高速に実行できるため、テキスト索引を超えて、関連するドキュメントを取得するために両方の手法を適用するオーバーヘッドが最小限になります。セマンティック検索では、シノニムの使用や時折タイポに関係なくドキュメントを取得できます。キー・フレーズ検索では、ユーザーが特定のサブジェクト(製品名や関数名など)について非常に具体的に尋ねているシナリオを取得できます。結果を組み合わせることで、互いに補完し合い、RAGプロセス全体に堅牢性を加えることができます。

承認

著者 - Callan Howell-Pavia (APACソリューション・スペシャリスト)

その他の学習リソース

docs.oracle.com/learnの他のラボを確認するか、Oracle Learning YouTubeチャネルで無料のラーニング・コンテンツにアクセスしてください。また、education.oracle.com/learning-explorerにアクセスしてOracle Learning Explorerになります。

製品ドキュメントは、Oracle Help Centerを参照してください。

タイトルおよび著作権情報

Implement Simple Hybrid Search for Retrieval Augmented Generation using Oracle Database 23ai

G19804-01

November 2024

Oracle and/or its affiliates.