Setting Up Database Sources

With a database source, you can crawl any JDBC-enabled database. A database source can crawl database content projected as a view or query. Each record in the view or query result set is interpreted as a document. You can create public database sources or secure database sources.

See Also:

"Choosing Between Table and Database Sources" for a comparison of the benefits and limitations of database sources and table sources.

Required Columns in Database Sources

The view or query to be crawled must contain the columns described in Table 9-1. All column names must be in upper case.

Table 9-1 Database Source Required Columns

Column Type Description

CONTENT

VARCHAR2 or CLOB

Document content.

KEY

VARCHAR2

Key to identify the record in the record set. You can use a custom name for this column by modifying drivers.properties. See "Configuring the JDBC Driver".

LANG

VARCHAR2

Document language in ISO 639-1 language code; for example, en for English or ja for Japanese.

LASTMODIFIEDDATE

DATE

Last modified date of the document.

If you do not have a column for the mandatory LastModifiedDate attribute, use a constant date value in the SQL query for the source. Use a format that the getTimestamp method of the corresponding JDBC driver accepts without errors. Incremental changes to records are not picked up by re-crawls, so always schedule a full crawl.

URL

VARCHAR2

Display URL for the document. The value for this column cannot be null. This connector requires that there is URL-based access to the records in the result set of the view or query.


Optional Columns in Database Sources

The view or query can contain the optional columns describe in Table 9-2. Any other column is considered an attribute of the document.

If the query or view contains both content and either an attachment or attachment link, then one column (in the following order) is considered document content:

  1. ATTACHMENT_LINK

  2. ATTACHMENT

  3. CONTENT

Even if the ATTACHMENT_LINK or ATTACHMENT column is specified in the query, you should include the mandatory CONTENT column. However, the content of ATTACHMENT_LINK or ATTACHMENT is indexed as document content.

Table 9-2 Database Source Optional Columns

Column Type Description

ATTACHMENT

BLOB

Binary attachments for the document.

ATTACHMENT_LINK

VARCHAR2

A link to the attachment for the document. HTTP, HTTPS, FILE, and FTP are valid.)

CONTENTTYPE

VARCHAR2

Content type of the document; for example, "text/html" for HTML documents, "application/pdf" for PDF documents, or "application/msword" for Microsoft Word documents.

Leave blank when the content type is unknown or varied so that is it not feasible to specify the content type for each document individually.

PATH

VARCHAR2

Path to the document. This is used in the browse feature. This can be used to represent the organizational hierarchy of the document. For example, level1#level2#level3.

TITLE

VARCHAR2

Title of the document to be displayed in the Oracle SES search result page.

LMD_TIMEZONE

VARCHAR2

Specifies the time zone for the date specified in LASTMODIFIEDDATE. For example CST. Oracle SES converts the last modified date from the specified time zone to Oracle SES time zone. If the time zone is not specified, then the date is considered to be in the Oracle SES time zone. See "Searching on Date Attributes" for more details about the default Oracle SES time zone.


Configuring the JDBC Driver

Depending on your database source, you may need to configure the JDBC driver.

To crawl any third-party database:  

  1. Download the appropriate JDBC driver jar into the ORACLE_HOME/search/lib/plugins/oracleapplications directory in Oracle SES. If the JDBC drivers for JRE 1.5 and JRE 1.6 are different, then perform the following:

    1. Download both the driver jars into ORACLE_HOME/search/lib/plugins/oracleapplications directory in Oracle SES.

    2. Add an entry for the JRE 1.6 version of the driver jar to the CLASSPATH element of ORACLE_HOME/search/config/searchctl.conf.

    3. Restart the middle tier.

  2. Update the drivers.properties file with the following information: DatabaseName:DriverClassName.

  3. Add the JRE 1.5 JDBC driver jar file name to the classpath in MANIFEST.MF of appsjdbc.jar and DBCrawler.jar.

For a key attribute that is not named KEY: 

  • Change the JDBC driver information in the drivers.properties file to specify the key attribute name:

    database_name: driver_class_name, key_attribute_name

For example, for a key attribute named ID:

oracle : oracle.jdbc.driver.OracleDriver, ID

In the crawling query, use key_attribute_name as the alias for the key value column name. In this example, ID is the alias for KEYVAL:

SELECT keyval id, content, url, lastmodifieddate, lang FROM sales_only

For Oracle and SQL Server databases, the following default drivers are used if none is specified in drivers.properties:

  • Oracle: oracle.jdbc.driver.OracleDriver

  • SQL Server: com.microsoft.sqlserver.jdbc.SQLServerDriver

Query File XML Schema Definition

The following is the XSD that defines the format of the XML query file.

<!--[if !supportEmptyParas]-->XSD for the XML sub-queries file:<!--[endif]-->
<?xml version="1.0" encoding="windows-1252" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"  xmlns="http://xmlns.oracle.com/ses/sqlconnector/detail-attribute-queries" targetNamespace="http://xmlns.oracle.com/ses/sqlconnector/detail-attribute-queries"  elementFormDefault="qualified">
  <xsd:complexType name="sqlQueriesType">
    <xsd:annotation>
      <xsd:documentation>
        Specify detail and attribute queries as a source parameter for 
        each document fetched by the parent query.
      </xsd:documentation>
    </xsd:annotation>
    <xsd:sequence>
      <xsd:element name="attachmentQueries" maxOccurs="1" minOccurs="0">
        <xsd:annotation>
          <xsd:documentation>
            Specify detail queries to fetch detail records for each document
            represented by the parent record. The parent records, fetched by 
            the parent query, are specified as a source parameter. Each record 
            in the document (parent) query can be associated with several detail
            (child) records. Each of these child records has a single column
           specifying the content that will be indexed as attachment to the
           parent document. The child query should select a single column, and
           the WHERE clause should have bind variables of the form 
           ##PARENT ATTR##, where the value of PARENT ATTR from the parent
           record is substituted while executing the detail query. 
         </xsd:documentation> 
        </xsd:annotation>
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="query" maxOccurs="unbounded" minOccurs="1">
             <xsd:complexType>
                <!--Attribute to specify whether the contents retrieved by the
                  query is inline attachment or link to an attachment. The value
                  "true" specifies that the content is a link to an attachment
                  and "false" indicates inline attachment. Default value is
                  false.-->                                    
                <xsd:attribute name="link" default="false"/>
                <!--Content type of the attachment. If no value is specified, 
                  SES will auto-detect the content type.-->
                <xsd:attribute name="contenttype" default="null"/>
              </xsd:complexType>
            </xsd:element>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
      <xsd:element name="attributeQueries" maxOccurs="1" minOccurs="0">
        <xsd:annotation>
          <xsd:documentation>
            Specify queries to retrieve values of attributes of the parent
            document. Use this feature if the attribute can contain multiple
            values for a document. If the attribute is a single-valued 
            attribute, then it can be specified in the parent query. The WHERE
            clause should have bind variables of the form ##PARENT ATTR##,
            where the value of PARENT ATTR from the parent record is substituted
            while executing the query.
          </xsd:documentation>
        </xsd:annotation>
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="query" maxOccurs="unbounded" minOccurs="1"/>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:element name="sqlQueries" type="sqlQueriesType"/>
</xsd:schema>

Creating Public Database Sources

Public database sources have no security implemented in Oracle SES.

To create a public database source: 

  1. Create a database source on the Home - Sources page. Select Database from the Source Type list, and click Create.

  2. Enter the database source parameters as described in Table 9-3.

  3. Click Next.

  4. Set authorization to No Access Control List, and clear the authorization manager class name and jar name.

  5. Click Create to create the database source.

Table 9-3 Database Source Parameters

Parameter Value

Database Connection String

JDBC connection string for the database with content to be crawled. The JDBC string is driver-specific. For example, jdbc:oracle:thin:@server:port:SID

User ID

User ID to log in to the database specified in Database Connection String. This user ID must have access to the schema owning the view specified in View or the query specified in Query.

Password

Password to log in to the database specified in Database Connection String.

View

Table or view to be crawled. Specify either View or Query, not both.

Document Count

Maximum number of documents to be crawled before indexing. Enter -1 to crawl all documents before indexing.

Query File

Path to the XML file specifying the subqueries to crawl attachments and attributes of documents corresponding to every record in the main query. See "Query File XML Schema Definition".

Query

Query projecting the content to be crawled. Specify either View or Query, not both.

URL Prefix

String that precedes the content of the URL column and forms a display URL for the document.

Cache File

Prefix of a local file name in which the contents can be temporarily cached while crawling.

Path Separator

The character separating the tokens in the PATH of the document as returned by the query or view. It must be a single character, and it cannot be a space, a single or double quote, or a control character.

Parse Attributes

Enter true to extract the values of the attributes from the document content specified in the SOLUTION or CONTENT column. Enter false otherwise, or when the content is type text/html.

In this example, attr1 and attr2 are extracted as attributes of the document with values 22 and 333 respectively:

<attr1>22</attr1> <attr2>333</attr2>

Content up to the first attribute is interpreted as the document content. The remaining portion is used to extract attributes only. In this example, only "page" is considered document content:

page<attr1>22</attr1> is <attr2>333</attr2> dispersed

Remove Deleted Documents

Enter true to remove deleted documents from the index; otherwise, enter false.

Attachment Link Authentication Type

Standard Java authentication type used by the application serving the link in the ATTACHMENT_LINK column. Enter one of these values:

  • PUBLIC: No authentication.

  • DIGEST: Digest authentication

  • BASIC: Basic authentication

  • NATIVE: Native authentication in the source

Attachment Link User ID

User ID for accessing the links specified in the ATTACHMENT_LINK column. Required when the link targets are secure.

Attachment Link Password

Password for Attachment Link User ID.

Attachment Link Realm

Realm of the application serving the link in the ATTACHMENT_LINK column. Required when the link targets are secure.

Grant Security Attributes

Leave blank for public sources.

Deny Security Attributes

Leave blank for public sources.


Defining User-Defined Security for Database Sources

Some attributes in the view or query being crawled must be identified as security attributes. The values of these attributes determine if a user is authorized to view a document. These attributes can be either GRANT attributes or DENY attributes.

See Also:

"Authorization Plug-in API" for more information about these attribute types and the user-defined security model

To create a database source with user-defined security: 

  1. On the Home - Sources page, select Database from the Source Type list and click Create.

  2. Enter values for the parameters as described Table 9-3. Specify the GRANT and DENY attributes as values for parameters Grant Security Attributes and Deny Security Attributes respectively. If there are multiple GRANT or DENY security attributes, then separate attribute names with a space.

  3. Click Next.

  4. Enter values for the authorization plug-in parameters:

    • Authorization Database Connection String: JDBC connection string for the authorization database. The values of the security attributes to which a given user is authorized are retrieved from this database. The JDBC string is driver-specific.

    • User ID: User ID to login to the authorization database.

    • Password: Password to login to the authorization database.

    • Authorization Query: SQL query to retrieve the values of security attributes to which a given user is authorized. The SELECT clause of this query should have all the security attributes specified in Step 2 with identical names. This query can be of two types:

      • The query can return a single record for a given user. The value in each security attribute column should be a space-delimited list of values to which the user is authorized.

      • The query can return multiple records for a given user. The value in each security attribute column of every row of the result set of this query is interpreted as a single value.

        Specify a question mark (?) as the placeholder for the username in the query.

    • Single Record Query: Enter true if the authorization query returns a single record for a given user.

    • Authorization User ID Format: Format of the user ID to be used in the SQL query specified in Authorization Query. This format should be an authentication attribute of the active identity plug-in.

      For example, if Oracle SES is configured with the Oracle Internet Directory identity plug-in (which supports DN, nickname and e-mail address as authentication attributes), then this parameter can be specified as nickname. The nickname of the current user is then used in the SQL authorization query to build the security filter.

      If no value is specified for this parameter, then the user ID in the canonical form of the active identity plug-in is used in the authorization query to build the security filter.

  5. Click Create to create the database source.

Database Search Attributes

Database sources have no predefined attributes. The crawler collects attributes from columns defined during source creation. You must map the columns to the search attributes.

Example of Creating a Database Source With User-Defined Security

The document set to be crawled is in tables T1 and T2 as specified by the following query:

SELECT 
     T1.ID, 
     T1.DESCRIPTION, 
     T2.NAME, 
     T1.LAST_UPDATE_DATE, 
     T2.AUTH_ID, T1.HIERARCHY
FROM 
     T1, T2 
WHERE 
     T1.ID = T2.DOC_ID 

The document content is provided by the T1.DESCRIPTION column.

Each document has an HTTP access URL of the form http://my.company.com/docserver?doc_id=document_identifier.

Access to a document is controlled by the value of T2.AUTH_ID. A document is accessible to a user SCOTT only if the value of T2.AUTH_ID for the document is in the list of AUTH_IDs for the user as retrieved by the following query:

SELECT AUTH_ID FROM USER_AUTH A 
     WHERE A.USER='SCOTT' 

This source can be crawled as a database source type with the following source parameter values:

  • Database Connection String: jdbc:oracle:thin:@example:7777:ses

  • User ID: apps_user

  • Password: ******

  • View:

  • Document Count: -1

  • Query:

    SELECT 
         'docserver?doc_id='|| T1.ID URL, 
         T1.ID "KEY", 
         'en' LANG, 
         T1.LAST_UPDATE_DATE LASTMODIFIEDDATE, 
         T1.DESCRIPTION CONTENT, 
         'text/plain' CONTENTTYPE, 
         T2.NAME CUSTOMER_NAME, 
         T2.AUTH_ID, 
         T1.HIERARCHY PATH 
    FROM 
         T1, T2 
    WHERE 
         T1.ID=T2.DOC_ID
    
  • Query File:

    <?xml version="1.0" encoding="UTF-8" ?>
    <sqlQueries xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.oracle.com/ses/sqlconnector/detail-attribute-queries detail-attribute-queries.xsd"
     xmlns="http://xmlns.oracle.com/ses/sqlconnector/detail-attribute-queries">
       <attachmentQueries>
         <query>
           <![CDATA[SELECT COMMENTS FROM COMMENTS_TBL WHERE DOCID=##KEY##]]>
         </query>
         <query link="true" contenttype='text/html'>
                <![CDATA[SELECT NOTES FROM NOTES_LINK_TBL WHERE DOCID=##KEY##]]>
         </query>
      </attachmentQueries>
      <attributeQueries>
           <query>
             <![CDATA[SELECT AUTHOR FROM AUTHORS_TBL WHERE DOCID=##KEY##]]>
           </query>
           <query>
                    <![CDATA[SELECT KEYWORD FROM KEYWORD_TBL WHERE DOCID=##KEY##]]>
           </query>
      </attributeQueries>
    </sqlQueries>
     
    

    AUTHID and KEY are columns in the select list of the parent query.

    Note: This must be the path to a local file containing the subqueries for attributes and attachments that are currently listed directly for query file.

  • URL Prefix: http://my.example.com/

  • Cache File: /tmp/cacheFile

  • Path Separator: #

  • Parse Attributes: false

  • Grant Security Attributes: AUTH_ID

  • Deny Security Attributes:

The following are sample parameter values for authorization.

  • Database Connection String: jdbc:oracle:thin:@example:7777:ses

  • User ID: apps_user

  • Password: *****

  • Authorization Query:

    SELECT AUTH_ID 
    FROM USER_AUTH A 
    WHERE A.USER=UPPER(?)
    
  • Single Record Query: false

  • Authorization User ID Format: username