5.8 Indexing External Documents

You can use semantic indexing on documents that are stored in a file system or on the network. In such cases, you store the references to external documents in a table column, and you create a semantic index on the column using an appropriate extractor policy.

To index external documents, define an extractor policy with appropriate preferences, using an XML document that is assigned to the preferences parameter of the SEM_RDFCTX.CREATE_POLICY procedure, as in the following example:

begin
  sem_rdfctx.create_policy (
       policy_name => 'SEM_EXTR_FROM_FILE',
       extractor   => mdsys.gatenlp_extractor()),
       preferences => sys.xmltype('<RDFCTXPreferences>
                                     <Datastore type="FILE"> 
                                        <Path>EXTFILES_DIR</Path>
                                     </Datastore>
                                   </RDFCTXPreferences>')); 
end;
/

The <Datastore> element in the preferences document specifies the type of repository used for the documents to be indexed. When the value for the type attribute is set to FILE, the <Path> element identifies a directory object in the database (created using the SQL statement CREATE DIRECTORY). A table column indexed using the specified extractor policy is expected to contain relative paths to individual files within the directory object, as shown in the following example:

CREATE TABLE newsfeed (docid       number, 
                       articleLoc  VARCHAR2(100)); 
INSERT INTO into newsfeed (docid, articleLoc) values
                     (1, 'article1.txt'); 
INSERT INTO newsfeed (docid, articleLoc) values
                     (2, 'folder/article2.txt'); 
 
CREATE INDEX ArticleIndex on newsfeed (articleLoc)
   INDEXTYPE IS mdsys.SemContext PARAMETERS ('SEM_EXTR_FROM_FILE');

To index documents that are accessed using HTTP protocol, create a extractor policy with preferences that set the type attribute of the <Datastore> element to URL and that list one or more hosts in the <Path> elements, as shown in the following excerpt:

<RDFCTXPreferences>
   <Datastore type="URL"> 
       <Path>http://cnn.com</Path>
       <Path>http://abc.com</Path>
   </Datastore>
</RDFCTXPreferences>

The schema in which a semantic index for external documents is created must have the necessary privileges to access the external objects, including access to any proxy server used to access documents outside the firewall, as shown in the following example:

-- Grant read access to the directory object for FILE data store -- 
grant read on directory EXTFILES_DIR to SEMUSR;
 
-- Grant connect access to set of hosts for URL data store -- 
begin
  dbms_network_acl_admin.create_acl (
                acl          => 'network_docs.xml',
                description  => 'Normal Access',
                principal    => 'SEMUSR',
                is_grant     => TRUE,
                privilege    => 'connect');
end;
/
 
begin
  dbms_network_acl_admin.assign_acl (
               acl        => 'network_docs.xml',
               host       =>  'cnn.com',
               lower_port => 1,
               upper_port => 10000);
end;
/

External documents that are semantically indexed in the database may be in one of the well-known formats such as Microsoft Word, RTF, and PDF. This takes advantage of the Oracle Text capability to extract plain text version from formatted documents using filters (see the CTX_DOC.POLICY_FILTER procedure, described in Oracle Text Reference). To semantically index formatted documents, you must specify the name of a CTX policy in the extractor preferences, as shown in the following excerpt:

<RDFCTXPreferences>
   <Datastore type="FILE" filter="CTX_FILTER_POLICY"> 
       <Path>EXTFILES_DIR</Path>
   </Datastore>
</RDFCTXPreferences>

In the preceding example, the CTX_FILTER_POLICY policy, created using the CTX_DDL.CREATE_POLICY procedure, must exist in your schema. The table columns that are semantically indexed using this preferences document can store paths to formatted documents, from which plain text is extracted using the specified CTX policy. The information extractor associated with the extractor policy then processes the plain text further, to extract the semantics in RDF/XML format.