ファイルからテキスト、チャンク、Oracle Database内の埋込みへの変換

最初にPDFファイルをテキストに変換して、テキストをチャンクに分割し、データベースに格納されているベクトル埋込みモデルにアクセスして各チャンクのベクトル埋込みを作成します。

各ステージで結果を確認、検査および修正し、さらに続行するスタンドアロン・アプリケーションでは、このようなパラレルまたはステップバイステップの変換を実行できます。

ここでは、DBMS_VECTOR_CHAINパッケージの一連の関数(UTL_TO_TEXT、UTL_TO_CHUNKS、UTL_TO_EMBEDDINGSなど)を使用します。

ステップバイステップの変換チェーンを使用し、データベースに格納されているベクトル埋込みモデルを使用して埋込みを生成するには:

ローカル・ユーザーとして接続し、データ・ダンプ・ディレクトリを準備します。

SQL*PlusにSYSユーザーとしてログインし、SYSDBAとして接続します。

conn sys/password as sysdba

CREATE TABLESPACE tbs1
DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON
EXTENT MANAGEMENT LOCAL
SEGMENT SPACE MANAGEMENT AUTO;

SET ECHO ON
SET FEEDBACK 1
SET NUMWIDTH 10
SET LINESIZE 80
SET TRIMSPOOL ON
SET TAB OFF
SET PAGESIZE 10000
SET LONG 10000

ローカル・ユーザー(docuser)を作成し、必要な権限を付与します。

drop user docuser cascade;

create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;

grant DB_DEVELOPER_ROLE to docuser;

入力データおよびモデル・ファイルを格納するローカル・ディレクトリ(VEC_DUMP)を作成します。必要な権限を付与します。
```
create or replace directory VEC_DUMP as '/my_local_dir/';
```
```
grant read, write on directory VEC_DUMP to docuser;

commit;
```
ローカル・ユーザー(docuser)として接続します:
```
conn docuser/password
```

ファイルをテキストに変換します

リレーショナル表(documentation_tab)を作成し、PDFドキュメント(『Oracle Database概要』)をその中に格納します。

drop table documentation_tab purge;

CREATE TABLE documentation_tab (id number, data blob);

INSERT INTO documentation_tab values(1, to_blob(bfilename('VEC_DUMP', 'database-concepts23ai.pdf')));

commit;

SELECT dbms_lob.getlength(t.data) from documentation_tab t;

UTL_TO_TEXTをコールして、PDFドキュメントをテキスト形式に変換します。
```
SELECT dbms_vector_chain.utl_to_text(dt.data) from documentation_tab dt;
```

出力の抜粋を次に示します。

DBMS_VECTOR_CHAIN.UTL_TO_TEXT(DT.DATA)
--------------------------------------------------------------------------
Database Concepts
23ai
Oracle Database Database Concepts, 23ai

This software and related documentation are provided under a license agreement containing restrictions on 
use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your
license agreement or allowed by law, you may not use, copy, reproduce, translate
, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any for
m, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by
law for interoperability, is prohibited.

Contents

Preface
Audience
xxiii
Documentation Accessibility
xxiii
Related Documentation
xxiv
Conventions
xxiv
1
Introduction to Oracle Database
About Relational Databases
1-1
Database Management System (DBMS)
1-2
Relational Model
1-2
Relational Database Management System (RDBMS)
1-3
Brief History of Oracle Database
1-3
Schema Objects
1-5
Tables
1-5
Indexes
1-6
Data Access

1 row selected.

テキストをチャンクに変換します。

UTL_TO_CHUNKSをコールして、テキスト・ドキュメントをチャンク化します。

ここでは、デフォルトのチャンク・パラメータを使用してチャンクを作成します。

SELECT ct.* 
  from documentation_tab dt, 
  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(dt.data)) ct;

出力の抜粋を次に示します。

{"chunk_id":1,"chunk_offset":1508024,"chunk_length":579,"chunk_data":"Inventory 
\n\n\n\nAnalysis \n\n\n\nReporting \n\n\n\nMining\n\n\n\nSummary \n\n\n\nData
 \n\n\n\nRaw Data\n\n\n\nMetadata\n\n\n\nSee Also:\n\n\n\nOracle Database Data 
Warehousing Guide to learn about transformation \n\n\n\nmechanisms\n\n\n\nOvervie
w of Extraction, Transformation, and Loading (ETL) \n\n\n\nThe process of extrac
ting data from source systems and bringing it into the warehouse is \n\n\n\ncomm
only called ETL: extraction, transformation, and loading. ETL refers to a broad
process \n\n\n\nrather than three well-defined steps.\n\n\n\nIn a typical scenar
io, data from one or more operational systems is extracted and then"}

{"chunk_id":2,"chunk_offset":1508603,"chunk_length":607,"chunk_data":"physica
lly transported to the target system or an intermediate system for processing. \
n\n\n\nDepending on the method of transportation, some transformations can occur
 during this \n\n\n\nprocess. For example, a SQL statement that directly accesse
s a remote target through a \n\n\n\ngateway can concatenate two columns as part
of the \n\n\n\nSELECT\n\n\n\nstatement. \n\n\n\nOracle Database is not itself an
 ETL tool. However, Oracle Database provides a rich set of \n\n\n\ncapabilities
usable by ETL tools and customized ETL solutions. ETL capabilities provided by \
n\n\n\nOracle Database include:\n\n\n\n? \n\n\n\nTransportable tablespaces"}

3728 rows selected.

チャンク出力に余分なスペースと改行文字(\n\n)が含まれていることに注意します。これは、正規化がデフォルトで適用されないためです。次のステップに示すように、normalizeなどのカスタム・チャンク化指定を適用して、出力から重複する文字、余分なスペースまたは改行文字を省略できます。分割条件や最大サイズ制限など、他のチャンク仕様を適用することで、チャンクをさらに絞り込むことができます。

正規化および一部のカスタム・チャンク化指定を使用するには、次のJSONパラメータを適用します(「チャンク化の手法と例」を参照)。

SELECT ct.* 
  from documentation_tab dt, 
  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(
    dt.data),   
    JSON('{
           "by" : "words",
           "max" : "100",
           "overlap" : "0",
           "split" : "recursively",
           "language" : "american",
           "normalize" : "all"
          }')) ct;

出力は、次のように表示されます。

{"chunk_id":2536,"chunk_offset":1372527,"chunk_length":633,"chunk_data":"The dat
abase maps granules to parallel execution servers at execution time. When a para
llel execution server finishes reading the rows corresponding to a granule, and 
when granules remain, it obtains another granule from the query coordinator. This
operation continues until the table has been read. The execution servers send 
results back to the coordinator, which assembles the pieces into the desired full
table scan. Oracle Database VLDB and Partitioning Guide to learn how to use 
parallel execution. Oracle Database Data Warehousing Guide to learn about 
recommended"}

{"chunk_id":2537,"chunk_offset":1373160,"chunk_length":701,"chunk_data":"initial
ization parameters for parallelism\n\nChapter 18\n\nOverview of Background Proce
sses\n\nApplication and Oracle Net Services Architecture\n\nThis chapter defines
application architecture and describes how an Oracle database and database appli
cations work in a distributed processing environment. This material applies to 
almost every type of Oracle Database environment. Overview of Oracle Application
Architecture In the context of this chapter, application architecture refers to 
the computing environment in which a database application connects to an Oracle
database."}

3728 rows selected.

チャンク化の結果には、次のものが含まれます。

chunk_id: 各チャンクのチャンクID。
chunk_offset: ソース・ドキュメント内の各チャンクの元の位置(ドキュメントの開始位置(1の位置)を基準)
chunk_length: 各チャンクの文字長
chunk_data: 各チャンクのテキスト・ピース

チャンクを埋込みに変換します。

load_onnx_modelプロシージャをコールして、埋込みモデルをOracle Databaseにロードします。
```
EXECUTE dbms_vector.drop_onnx_model(model_name => 'doc_model', force => true);
```
```
EXECUTE dbms_vector.load_onnx_model('VEC_DUMP', 'my_embedding_model.onnx', 'doc_model', JSON('{"function" : "embedding", "embeddingOutput" : "embedding" , "input": {"input": ["DATA"]}}'));
```
この例では、プロシージャによって、my_embedding_model.onnxという名前のONNXモデル・ファイルがVEC_DUMPディレクトリからデータベースにdoc_modelとしてロードされます。my_embedding_model.onnxを埋込みモデルのONNXエクスポート、およびdoc_modelをインポートされたモデルがデータベースに格納される名前に置き換える必要があります。

ノート:
ONNX形式の埋込みモデルがない場合は、ONNXパイプライン・モデル: テキスト埋込みにリストされているステップを実行します。

UTL_TO_EMBEDDINGSをコールして、チャンクに対応する一連のベクトル埋込みを生成します。

var embed_params clob;
exec :embed_params := '{"provider":"database", "model":"doc_model"}';

SELECT et.* from 
  documentation_tab dt,
  dbms_vector_chain.utl_to_embeddings(
    dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(dt.data)),
    json(:embed_params)) et;

出力の抜粋を次に示します。

{"embed_id":"1","embed_data":"Introduction to Oracle Database\n\n\nThis
chapter provides an overview of Oracle Database. Every organization has informat
ion that it must store and manage to meet its requirements. For example, a corpo
ration must collect and maintain human resources records for its employees. About
Relational Databases and Schema Objects\n\n\nData Access tables with parent keys,
base, upgrade, UROWID data type, user global area (UGA), user program interface,
","embed_vector":"[0.111119926,0.0423980951,-0.00929224491,-0.0352411047,-0.0144
591287,0.0277361721,0.183199733,-0.0245029964,-0.137614027,0.0730137378,0.017934
6036,0.0788726509,0.0176453814,0.100403085,-0.0518687107,-0.0152645027,0.0283792
187,-0.114087239,0.0139923804,0.0747490972,-0.181839675,-0.130034953,0.101207718
,0.117135495,-0.0682030097,-0.217743069,0.0613380745,0.0150767341,0.0361393057,-
0.113082513,-0.0550440662,-0.000983044971,-0.00719357422,0.1590323,-0.0220414512
,-0.0723528489,-0.0126240514,-0.175765082,0.168952227,0.0466451086,-0.12136507,-
0.0442310236,0.0139067639,0.054659389,-0.29653421,-0.0988782048,0.0794349238,-0.
0758788213,0.0152856084,-0.0260562375,0.0652966872,-0.0782724097,-0.0226081386,0
.0909011662,-0.184569761,0.159565002,-0.15350005,-0.0108382348,0.101788878,1.919
59683E-002,2.54665539E-002,2.50248201E-002,5.29858321E-002,1.42359538E-002,5.655
82886E-002,3.41602638E-002,3.18607911E-002,-3.07250433E-002,-3.60006578E-002,-3.
26940455E-002,-5.13980947E-002,-9.18597169E-003,-2.40122043E-002,2.15246622E-002
,-3.89301814E-002,1.09825116E-002,-8.59739035E-002,-3.34327705E-002,-6.52310252E
-002,2.46418975E-002,6.27725571E-003,6.54156879E-002,-2.97986511E-003,-1.485541E
-003,-9.00155635E-003]"}

{"embed_id":2,"embed_data":"1-18 Database, 19-6 write-ahead,18-17 Learn session 
memory in a large pool. The multitenant architecture enables an Oracle database 
to function as a multitenant container database (CDB). XStream,20-38\nZero Data 
Loss Recovery Appliance See Recovery Appliance zone maps, An application contai
ner consists of exactly one application root, and PDBs plugged in to this root.
Index-21\n D:20231208125114-08'00' D:20231208125114-08'00' D:20231208125114-08'0
0' 19-6","embed_vector":"[6.30229115E-002,6.80943206E-002,-6.94655553E-002,-2.58
157589E-002,-1.89648587E-002,-9.02655348E-002,1.97774544E-002,-9.39233322E-003,-
5.06882742E-002,2.0078931E-002,-1.28898735E-003,-4.10606936E-002,2.09831214E-003
,-4.53372523E-002,-7.09890276E-002,5.38906306E-002,-5.81014939E-002,-1.3959175E-
004,-1.08725717E-002,-3.79145369E-002,-4.39973129E-003,3.25602405E-002,6.5887302
2E-002,-4.27212603E-002,-3.00925318E-002,3.07144262E-002,-2.26370787E-004,-4.623
15865E-002,1.11807801E-001,7.36674219E-002,-1.61244173E-003,7.35205635E-002,4.16
726843E-002,-5.08309156E-002,-3.55720241E-003,4.49763797E-003,5.03803678E-002,2.
32542045E-002,-2.58533042E-002,9.06257033E-002,8.49585086E-002,8.65213498E-002,5
.84013164E-002,-7.72946924E-002,6.65430725E-002,-1.64568517E-002,3.23978886E-002
,2.24988302E-003,3.02566844E-003,-2.43405364E-002,9.75424573E-002,4.14630538E-00
3,1.89351812E-002,-1.10467218E-001,-1.24333188E-001,-2.36738548E-002,7.54277706E
-002,-1.64660662E-002,-1.38906585E-002,3.42438952E-003,-1.88432514E-005,-2.47511
379E-002,-3.42802797E-003,3.23110656E-003,4.24311385E-002,6.59448802E-002,-3.311
67318E-002,-5.14010936E-002,2.38897409E-002,-9.00154635E-002]"}

3728 rows selected.

埋込み結果には次のものが含まれます。

embed_id: 各ベクトルの埋込みのID番号
embed_data: 埋込みに変換される入力テキスト
embed_vector: 生成されたベクトル表現

関連トピック

親トピック: チャンク化と埋込みの実行