エンティティ抽出の基本的な使用例

この項の例では、非常に単純なエンティティ抽出の基本例について説明します。この例では、次のテキストを含むclobがあることを前提とします。

New York, United States of America
The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.

この例では、CTX_ENTITY.EXTRACTを使用して、CLOB値のすべてのエンティティを検索します。(行の場合は、テキストがどのようにclobになるか、または出力clobをどのように提供するかを心配する必要はありません)

エンティティ抽出には、オプションの指定を可能にする新しいタイプのポリシーの抽出ポリシーが必要です。行に対して、次のようにデフォルトのポリシーを作成します。

ctx_entity.create_extract_policy( 'mypolicy' );

次に、extractをコールして作業を実行できます。これには、ポリシー名、処理するドキュメント、言語および出力clobの4つの引数が必要です(たとえば、dbms_lob.createtemporaryコールして初期化しておく必要があります)。

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )

outclobには、抽出されたエンティティを識別するXMLが含まれます。内容を表示すると(適切にフォーマットされるように、XMLTYPEとして選択することをお薦めします)、次のように表示されます。

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="1" offset="150" length="18" source="SuppliedRule">
    <text>Oracle Corporation</text>
    <type>company</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
  <entity id="3" offset="83" length="2" source="SuppliedRule">
    <text>5%</text>
    <type>percent</type>
  </entity>
  <entity id="4" offset="113" length="8" source="SuppliedDictionary">
    <text>software</text>
    <type>product</type>
  </entity>
  <entity id="5" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>state</type>
  </entity>
</entities>

これは、XML対応プログラムで処理する場合は問題ありません。ただし、よりSQLフレンドリなビューにする場合は、Oracle XML DB機能を使用して次のように変換できます。

select xtab.offset, xtab.text, xtab.type, xtab.source
from xmltable( '/entities/entity'
PASSING xmltype(outclob)
  COLUMNS 
    offset number       PATH '@offset',
    lngth number        PATH '@length',
    text   varchar2(50) PATH 'text/text()',
    type   varchar2(50) PATH 'type/text()',
    source varchar2(50) PATH '@source'
) as xtab order by offset;

これによって、次の出力が生成されます。

    OFFSET TEXT                      TYPE                 SOURCE
---------- ------------------------- -------------------- --------------------
         0 New York                  city                 SuppliedDictionary
         0 New York                  state                SuppliedDictionary
        10 United States of America  country              SuppliedDictionary
        83 5%                        percent              SuppliedRule
       113 software                  product              SuppliedDictionary
       150 Oracle Corporation        company              SuppliedRule

異なるタイプのエンティティがすべて必要ない場合は、フェッチするタイプを選択できます。これを実行するには、エンティティ・タイプのカンマ区切りのリストを使用して、4番目の引数をextractプロシージャに追加します。次に例を示します。

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' ) 
 
That would give us the XML
 
<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
</entities>