エンドツーエンド検索のための埋込みの生成および使用
最初に、データベースに格納されているベクトル埋込みモデルを使用して、テキスト・コンテンツからベクトル埋込みを生成し、その後、ベクトル索引に移入して問い合せます。問合せ時に、問合せ基準も即時にベクトル化します。
データベースに格納されているベクトル埋込みモデルにアクセスして、エンドツーエンドの類似検索ワークフローを実行するには:
- SQL*Plusを起動し、Oracle Databaseにローカル・ユーザーとして接続します。
- SQL*Plusに
SYS
ユーザーとしてログインし、SYSDBA
として接続します。conn sys/password as sysdba
CREATE TABLESPACE tbs1 DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
SET ECHO ON SET FEEDBACK 1 SET NUMWIDTH 10 SET LINESIZE 80 SET TRIMSPOOL ON SET TAB OFF SET PAGESIZE 10000 SET LONG 10000
- ローカル・ユーザー(
docuser
)を作成し、必要な権限を付与します。drop user docuser cascade;
create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;
grant DB_DEVELOPER_ROLE to docuser;
- 入力データおよびモデル・ファイルを格納するローカル・ディレクトリ(
VEC_DUMP
)を作成します。必要な権限を付与します。create or replace directory VEC_DUMP as '/my_local_dir/';
grant read, write on directory VEC_DUMP to docuser; commit;
- ローカル・ユーザー(
docuser
)として接続します:conn docuser/password
- SQL*Plusに
- リレーショナル表(
documentation_tab
)を作成し、テキスト・コンテンツを格納します。drop table documentation_tab purge;
create table documentation_tab (id number, text clob);
insert into documentation_tab values (1, 'Analytics empowers business analysts and consumers with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing. Oracle Analytics Cloud is a scalable and secure public cloud service that provides capabilities to explore and perform collaborative analytics for you, your workgroup, and your enterprise. Oracle Analytics Cloud is available on Oracle Cloud Infrastructure Gen 2 in several regions in North America, EMEA, APAC, and LAD when you subscribe through Universal Credits. You can subscribe to Professional Edition or Enterprise Edition.'); insert into documentation_tab values (3, 'Generative AI Data Science is a fully managed and serverless platform for data science teams to build, train, and manage machine learning models in the Oracle Cloud Infrastructure.'); insert into documentation_tab values (4, 'Language allows you to perform sophisticated text analysis at scale. Using the pretrained and custom models, you can process unstructured text to extract insights without data science expertise. Pretrained models include sentiment analysis, key phrase extraction, text classification, and named entity recognition. You can also train custom models for named entity recognition and text classification with domain specific datasets. Additionally, you can translate text across numerous languages.'); insert into documentation_tab values (5, 'When you work with Oracle Cloud Infrastructure, one of the first steps is to set up a virtual cloud network (VCN) for your cloud resources. This topic gives you an overview of Oracle Cloud Infrastructure Networking components and typical scenarios for using a VCN. A virtual, private network that you set up in Oracle data centers. It closely resembles a traditional network, with firewall rules and specific types of communication gateways that you can choose to use. A VCN resides in a single Oracle Cloud Infrastructure region and covers one or more CIDR blocks (IPv4 and IPv6, if enabled). See Allowed VCN Size and Address Ranges. The terms virtual cloud network, VCN, and cloud network are used interchangeably in this documentation. For more information, see VCNs and Subnets.'); insert into documentation_tab values (6, 'NetSuite banking offers several processing options to accurately track your income. You can record deposits to your bank accounts to capture customer payments and other monies received in the course of doing business. For a deposit, you can select payments received for existing transactions, add funds not related to transaction payments, and record any cash received back from the bank.'); commit;
load_onnx_model
プロシージャをコールして、埋込みモデルをロードします。EXECUTE dbms_vector.drop_onnx_model(model_name => 'doc_model', force => true);
EXECUTE dbms_vector.load_onnx_model( 'VEC_DUMP', 'my_embedding_model.onnx', 'doc_model', json('{"function" : "embedding", "embeddingOutput" : "embedding" , "input": {"input": ["DATA"]}}') );
この例では、プロシージャによって、
my_embedding_model.onnx
という名前のONNXモデル・ファイルがVEC_DUMP
ディレクトリからデータベースにdoc_model
としてロードされます。my_embedding_model.onnx
を埋込みモデルのONNXエクスポート、およびdoc_model
をインポートされたモデルがデータベースに格納される名前に置き換える必要があります。ノート:
ONNX形式の埋込みモデルがない場合は、「ONNXパイプライン・モデル: テキスト埋込み」にリストされているステップを実行します。doc_model
を使用して、非構造化データのチャンクおよび関連するベクトル埋込みを格納するリレーショナル表(doc_chunks
)を作成します。create table doc_chunks as ( SELECT d.id id, row_number() over (partition by d.id order by d.id) chunk_id, vc.chunk_offset chunk_offset, vc.chunk_length chunk_length, vc.chunk_text chunk, vector_embedding(doc_model using vc.chunk_text as data) vector FROM documentation_tab d, vector_chunks(d.text by words max 100 overlap 10 split RECURSIVELY) vc );
CREATE TABLE
文では、DOCUMENTATION_TAB
表からテキストを読み取り、最初にVECTOR_CHUNKS
SQL関数を適用して、指定されたチャンク化パラメータに基づいてテキストをチャンクに分割し、次にVECTOR_EMBEDDING
SQL関数を適用して、結果の各チャンク・テキストに対応するベクトル埋込みが生成されます。doc_chunks
表から行を選択して、チャンク化された出力を表示し、doc_chunksを確認します。desc doc_chunks; set linesize 100 set long 1000 col id for 999 col chunk_id for 99999 col chunk_offset for 99999 col chunk_length for 99999 col chunk for a30 col vector for a100
select id, chunk_id, chunk_offset, chunk_length, chunk from doc_chunks;
チャンク化の出力では、再帰的に分割された(つまり、
BLANKLINE
、NEWLINE
、SPACE
、NONE
のシーケンスを使用して) 7つのチャンクのセットが返されます。ドキュメント5では、最大ワード制限の100
に達したときに2つのチャンクが生成されます。最初のチャンクが空白行で終わることがわかります。最初のチャンクのテキストは、2番目のチャンクと重なっています。つまり、10ワード(カンマとピリオドを含む、下線が引かれています)が重なっています。同様に、5番目と6番目のチャンクの間に10ワードの重なり(下線が引かれています)があります。
ID CHUNK_ID CHUNK_OFFSET CHUNK_LENGTH CHUNK ---- -------- ------------ ------------ ------------------------------ 1 1 1 418 Analytics empowers business an alysts and consumers with mode rn, AI-powered, self-service a nalytics capabilities for data preparation, visualization, e nterprise reporting, augmented analysis, and natural languag e processing. Oracle Analytics Cloud is a scalable and secure public cloud service that provides ca pabilities to explore and perf orm collaborative analytics for you, your workgroup, and your enterprise. 1 2 373 291 for you, your workgroup, and your enterprise. Oracle Analytics Cloud is available on Oracle Cloud Inf rastructure Gen 2 in several r egions in North America, EMEA, APAC, and LAD when you subscr ibe through Universal Credits. You can subscribe to Professi onal Edition or Enterprise Edi tion. 3 1 1 180 Generative AI Data Science is a fully managed and serverless platform for data science tea ms to build, train, and manage machine learning models in th e Oracle Cloud Infrastructure. 4 1 1 505 Language allows you to perform sophisticated text analysis a t scale. Using the pretrained and custom models, you can pro cess unstructured text to extr act insights without data scie nce expertise. Pretrained models include sentiment analysis, key phras e extraction, text classificat ion, and named entity recognit ion. You can also train custom models for named entity recog nition and text classification with domai n specific datasets. Additiona lly, you can translate text ac ross numerous languages. 5 1 1 386 When you work with Oracle Clou d Infrastructure, one of the f irst steps is to set up a virt ual cloud network (VCN) for yo ur cloud resources. This topic gives you an overview of Orac le Cloud Infrastructure Networking components and typical scenar ios for using a VCN. A virtual , private network that you set up in Oracle data centers. It closely resembles a tradition al network, with 5 2 329 474 centers. It closely resembles a traditional network, with firewall rules and specif ic types of communication gate ways that you can choose to us e. A VCN resides in a single O racle Cloud Infrastructure reg ion and covers one or more CID R blocks (IPv4 and IPv6, if enable d). See Allowed VCN Size and A ddress Ranges. The terms virtu al cloud network, VCN, and clo ud network are used interchang eably in this documentation. For more information, see VCNs and Subnets. 6 1 1 393 NetSuite banking offers severa l processing options to accura tely track your income. You ca n record deposits to your bank accounts to capture customer payments and other monies rece ived in the course of doing business. For a deposit, you can select payments received for existin g transactions, add funds not related to transaction payment s, and record any cash receive d back from the bank. 7 rows selected.
- 埋込みの出力を表示するには、
doc_chunks
表から行を選択して、最初のベクトルの結果を確認します。select vector from doc_chunks where rownum <= 1;
出力の抜粋を次に示します。
[1.18813422E-002,2.53968383E-003,-5.33896387E-002,1.46877998E-003,5.77209815E-002,-1.58939194E-002,3 .12595293E-002,-1.13087103E-001,8.5138239E-002,1.10731693E-002,3.70671228E-002,4.03710492E-002,1.503 95066E-001,3.31836529E-002,-1.98343433E-002,6.16453104E-002,4.2827677E-002,-4.02921103E-002,-7.84291 551E-002,-4.79201972E-002,-5.06678E-002,-1.36317732E-002,-3.7761624E-003,-2.3332756E-002,1.42400926E -002,-1.11553416E-001,-3.70503664E-002,-2.60722954E-002,-1.2074843E-002,-3.55089158E-002,-1.03518805 E-002,-7.05051869E-002,5.63110895E-002,4.79055084E-002,-1.46315445E-003,8.83129537E-002,5.12795225E- 002,7.5858552E-003,-4.13030013E-002,-5.2099824E-002,5.75958602E-002,3.72097567E-002,6.11167103E-002, ,-1.23207876E-003,-5.46219759E-003,3.04734893E-002,1.80617068E-002,-2.85708476E-002,-1.01670986E-002 ,6.49402961E-002,-9.76506807E-003,6.15146831E-002,5.27246818E-002,7.44994432E-002,-5.86469211E-002,8 .84285953E-004,2.77456306E-002,1.99283361E-002,2.37570312E-002,2.33389344E-002,-4.07911092E-002,-7.6 1070028E-002,1.23929314E-001,6.65794984E-002,-6.15389943E-002,2.62510721E-002,-2.48490628E-002]
doc_chunks
表のvector
列に索引を作成します。create vector index vidx on doc_chunks (vector) organization neighbor partitions with target accuracy 95 distance EUCLIDEAN parameters ( type IVF, neighbor partitions 2);
- ベクトル索引を使用して問合せを実行します。
- 機械学習に関する問合せ:
select id, vector_distance( vector, vector_embedding(doc_model using 'machine learning models' as data), EUCLIDEAN) results FROM doc_chunks order by results;
ID RESULTS ---- ---------- 3 1.074E+000 4 1.086E+000 5 1.212E+000 5 1.296E+000 1 1.304E+000 6 1.309E+000 1 1.365E+000 7 rows selected.
-
生成AIに関する問合せ:
select id, vector_distance( vector, vector_embedding(doc_model using 'gen ai' as data), EUCLIDEAN) results FROM doc_chunks order by results;
ID RESULTS ---- ---------- 4 1.271E+000 3 1.297E+000 1 1.309E+000 5 1.32E+000 1 1.352E+000 5 1.388E+000 6 1.424E+000 7 rows selected.
-
ネットワークに関する問合せ:
select id, vector_distance( vector, vector_embedding(doc_model using 'computing networks' as data), MANHATTAN) results FROM doc_chunks order by results;
ID RESULTS ---- ---------- 5 1.387E+001 5 1.441E+001 3 1.636E+001 1 1.707E+001 4 1.758E+001 1 1.795E+001 6 1.902E+001 7 rows selected.
-
銀行に関する問合せ:
select id, vector_distance( vector, vector_embedding(doc_model using 'banking, money' as data), MANHATTAN) results FROM doc_chunks order by results;
ID RESULTS ---- ---------- 6 1.363E+001 1 1.969E+001 5 1.978E+001 5 1.997E+001 3 1.999E+001 1 2.058E+001 4 2.079E+001 7 rows selected.
- 機械学習に関する問合せ:
関連トピック
親トピック: チャンク化と埋込みの実行